unrpa_rs - A command line utility & library to extract RenPy archives

Posted on July 26, 2020

In this post I want to introduce unrpa_rs which is a command line utility and library to extract RenPy archives (RPAs), written in the Rust programming language. This can be used to extract various assets that have been bundled in the RPA format. Currently RPAv3.2, RPAv3, and RPAv2 are supported.

Motivation

For a long time I was interested in building something with Rust, however, up until now I never found the time to actually do that. After finishing my bachelor thesis last semester I finally had enough time to start this little side project to get better acquainted with Rust.

To start with the right level of difficulty I found a repo on GitHub called rpatool which is Python tool create, modify and extract RenPy archive files. RenPy is an open source Python game engine to primarily create visual novels. Thus, with the basic functionality already known, I decided to implement the extraction functionality in Rust.

CLI Usage

USAGE:
    unrpa_rs [FLAGS] <INPUT>

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information
    -v, --verbose    Increase verbosity level (-v, -vv, -vvv, etc.)

ARGS:
    <INPUT>    The path to the archive file to read from

Disclaimer

Use this tool only on archives on which the authors allow modification or extraction. The unauthorized use is highly discouraged since this poses most likely a license violation.

How it works

After reading the Python source code and grasping the underlying idea, it turns out the involved steps are pretty straightforward. I will describe them here briefly in the order they occur. Sometimes a snippet of the source code is listed for the important bits.

Opening the file descriptor and extracting metadata

The first steps consist of opening the file descriptor to get access to the referenced file and to extract some metadata about the archive. The metadata sits right in the first line and is separated by spaces. The first string is the magic literal which denotes the version of the RPA format, e.g. "RPA-3.2", "RPA-3.0", "RPA-2.0". The remaining data is the byte offset we need to jump to extract the indices, i.e. the files present in the archive, and the obfuscation key we need to deobfuscate the indices data. This key is simply constructed by subsequently applying XOR with the listed keys.

fn construct_obfuscation_key<S: AsRef<str>>(
        rpa_version: &RpaVersion,
        metadata: &[S],
    ) -> IntLen {
        let key: IntLen = match *rpa_version {
            RpaVersion::V3 => metadata.as_ref()[2..]
                .iter()
                .fold(0, |acc: IntLen, sub_key| {
                    acc ^ IntLen::from_str_radix(sub_key.as_ref(), 16).unwrap()
                }),
            RpaVersion::V3_2 => metadata.as_ref()[3..]
                .iter()
                .fold(0, |acc: IntLen, sub_key| {
                    acc ^ IntLen::from_str_radix(sub_key.as_ref(), 16).unwrap()
                }),
            RpaVersion::V2 => 0,
        };

        key
    }

Extracting the indices

The next steps consists of jumping to the offset, reading all bytes until EOF, and running a zlib decompression to get access to the decompressed byte buffer. Afterwards, we need to deserialize all indices with the Python pickle format.

// seek cursor to the decoded offset
reader.seek(SeekFrom::Start(offset))?;

let mut bytes: Vec<u8> = Vec::new();
// read everything until EOF
let bytes_read = reader.read_to_end(&mut bytes)?;
let mut decoded_bytes: Vec<u8> = Vec::with_capacity(2 * bytes_read);

// read the content by decoding it with zlib
ZlibDecoder::new(&bytes[..]).read_to_end(&mut decoded_bytes)?;
let deserialized_indices: RpaIdx = serde_pickle::from_slice(&decoded_bytes)?;

For the zlib decompression I used the flate2 crate, and for the pickle stuff serde in combination with serde-pickle. Now, we have a list of indices with fields like offset and len. However, for RPAv3 and RPAv3.2 we need to deobfuscate these fields by once more applying XOR with the previously yield obfuscation key.

Reading the byte buffer of the indices

With the previous step we now have all metadata to read a byte buffer of every index into memory. We only need to jump to the individual offset of the index and read exactly len bytes. However, since RPAv3.2 has a prefix field we need to encode that with latin1 or also called ISO-8859-1 and subtract the raw prefix len from the original offset. In this case byte buffer is appended at the encoded prefix and then returned. However, all RPAv3.2 I have seen up to this point have an empty prefix which yields also an empty encoded prefix. This portion of the code is rather untested.

self.reader.seek(SeekFrom::Start(offset))?;

let desired_capacity = len as usize - prefix.unwrap_or("").len();
let mut encoded_prefix = ISO_8859_1.encode(prefix.unwrap_or(""), EncoderTrap::Strict)?;

let mut buf = vec![0u8; desired_capacity];
// now read exactly `desired_capacity` bytes
self.reader.read_exact(&mut buf)?;
assert_eq!(desired_capacity, buf.len());

// append the byte vector at the prefix vector if it's not empty
if self.version == RpaVersion::V3 && !encoded_prefix.is_empty() {
    encoded_prefix.append(&mut buf);
    Ok(encoded_prefix)
} else {
    Ok(buf)
}

For the latin1 encoding process I used the encoding crate. Since we know have the raw byte buffer we can write that directly to the disk. This is rather uninteresting, so I am not listing that here.

Multithreading?

My original plan was to speed the reading process of all indices up by reading the byte buffers in parallel. However, I soon realized this is not possible since the access to underlying file resource can not be shared by multiple threads. Since I am not simply sequentially reading the file, but rather need to jump to a individual byte offset for every index, I decided to use the memmap crate in order to have a file-backed immutable memory map in the form a of Cursor I can use to jump around in the file and perform normal reading operations.

Update 2022-02-18

Since version 0.4.0 unrpa_rs uses multithreading on a per-file basis. This means that if you want to extract more than one archive at the same time, this crate will extract them in parallel on a threadpool, powered by the amazing rayon crate.

Benchmarks performed with the criterion crate showed a significant performance gain with this change. You can checkout them out yourself if you want and run them. Checkout the delete-io-systemcall tag in the repo and run the benchmarks by running cargo bench. Then checkout v0.3.0 and run them again to see the changes in the report. The nature of this being an I/O intensive project means that fast disks, such as SSDs mean better performance. Also, Linux and macOS offer in general better performance since I/O on Windows is more expensive.

Feedback welcome

Since this is my first Rust project I am open to accept suggestions from the community if they know a better way to do certain things. The gitlab repository would be the best place to go for that.