12 Dec 2012 wingo   » (Master)

corps: bespoke text codecs

Happy 12/12/12, peoples! In honor of repeated subsequences, today I'm happy to release a new set of compression tools, Corps.

Corps is a toolkit for generating custom text codecs, specialized to particular bodies of text. You give it an example corpus to analyze, and Corps can generate codecs based on what it finds.

For example, if you want to develop a compression algorithm that operates on JavaScript source text, you probably want to use a special code to represent the multi-character sequence function.

I say "probably", because in reality you don't know what substrings are most common. Corps uses the Re-Pair algorithm to build up an optimal token set. This algorithm treats all characters in the input as tokens, and recursively creates composite tokens from the most common pair of adjacent tokens, repeating the process until there are no more repeated pairs of tokens. For full details, see "Off-Line Dictionary-Based Compression" by Larsson and Moffat.

Corps is mostly useful for creating special-purpose fixed codecs based on a dictionary generated by its offline analysis. Although pre-generated codecs often do not compress as well as adaptive codecs like the one used by gzip, fixed codecs are typically faster as they can use the fixed code table to generate optimized encoders and decoders. Corps currently generates encoders in C and Scheme, and decoders in C, Scheme, and JavaScript.

Special-purpose codecs can also provide some interesting properties, such as the ability to decompress a substring of the original text.

get source

Corps is written for Guile version 2.0. See http://gnu.org/s/guile for more information on Guile and how to install it on your system.

To build Corps from git, do:

git clone git://gitorious.org/corps/corps.git
cd corps
./autogen.sh && ./configure && make && make check

You can install using make install, but it's often more convenient to just run corps uninstalled, using the env script.

stinky cheese

Corps includes a simple command-line tool called corps. For full documentation, run corps help. You can run it uninstalled by prefixing the ./env from the build tree.

As an example, say you want to build a database of wikipedia pages on cheese. Let's fetch a page:

$ curl -s 'http://en.wikipedia.org/wiki/Maroilles_(cheese)' > cheese
$ ls -l cheese
-rw-r--r-- 1 wingo wingo 43123 Dec 12 15:12 cheese

Now we analyze it to determine common substrings:

./env corps extract-all cheese > cheese-tokens

This generates a list of (string,frequency) pairs. An extract-all token set is usually quite large. We can pare it down to something manageable, the 500 most common substrings:

./env corps extract-subset cheese-tokens 500 cheese > 500-tokens

With this dictionary, we can huffman-code the page:

$ ./env corps encode -t 500-tokens -m huffman cheese cheese.huff
$ ls -l cheese.huff
-rw-r--r-- 1 wingo wingo 18060 Dec 12 16:09 cheese.huff

Well that's pretty cool -- it's less than half the size of the source text. We can also pare down this set of tokens to an appropriate number of tokens for a bytecode, and try doing a byte-encode of the file:

$ ./env corps extract-subset 500-tokens 254 cheese > 254-tokens
$ ./env corps encode -t 254-tokens cheese > cheese.bc
$ ls -l cheese.bc
-rw-r--r-- 1 wingo wingo 22260 Dec 12 16:19 cheese.bc

It's larger than the huffman-code, not only because of the smaller dictionary, but also because a bytecode is less dense. In practice though a bytecode is good enough while being very fast, so we'll continue with the bytecode.

Now let's generate a C encoder and decoder for this token set.

$ ./env corps generate-c-byte-encoder 254-tokens > encoder.inc.c
$ ./env corps generate-c-byte-decoder 254-tokens > decoder.inc.c
$ cp ~/src/corps/corps/decoder.c .
$ cp ~/src/corps/corps/encoder.c .
$ gcc -o decoder -O3 decoder.c
$ gcc -o encoder -O3 encoder.c
$ ls -l encoder decoder
-rwxr-xr-x 1 wingo wingo 13192 Dec 12 16:23 decoder
-rwxr-xr-x 1 wingo wingo 31048 Dec 12 16:23 encoder

Nice! We could use the corps tool to decode cheese.bc, but to vary things up we can use our zippy C decoder:

$ ./decoder < cheese.bc > cheese.out
$ cmp cheese.out cheese && echo 'excellent!'

The performance of the C encoder is pretty good:

$ for ((x=0;x<1000;x++)) do cat cheese >> megacheese; done
$ time gzip -c < megacheese > megacheese.gz
real	0m1.418s
user	0m1.396s
sys	0m0.016s
$ time ./encoder < megacheese > megacheese.bc
real	0m0.523s
user	0m0.480s
sys	0m0.044s
$ ls -l megacheese*
-rw-r--r-- 1 wingo wingo 43123000 Dec 12 17:03 megacheese
-rw-r--r-- 1 wingo wingo 22370002 Dec 12 17:04 megacheese.bc
-rw-r--r-- 1 wingo wingo 11519311 Dec 12 17:04 megacheese.gz

Gzip gets better compression results for many reasons, but our quick-and-dirty bytecode compressor does well enough and is quite fast. The decoder is also quite good:

$ time ./decoder < megacheese.bc > /dev/null
real	0m0.179s
user	0m0.160s
sys	0m0.016s
$ time gunzip -c < megacheese.gz > /dev/null
real	0m0.294s
user	0m0.284s
sys	0m0.008s

Amusingly, for this text, gzipping the bytecoded file has quite an impact:

$ gzip -c < megacheese.bc > megacheese.bc.gz
$ ls -l megacheese*
-rw-r--r-- 1 wingo wingo 43123000 Dec 12 17:03 megacheese
-rw-r--r-- 1 wingo wingo 22370002 Dec 12 17:04 megacheese.bc
-rw-r--r-- 1 wingo wingo   175246 Dec 12 17:12 megacheese.bc.gz
-rw-r--r-- 1 wingo wingo 11519311 Dec 12 17:04 megacheese.gz

It so happens that byte-compressing the original text allows it to fit within the default gzip "window size" of 32 KB, letting gzip detect the thousandfold duplication of the source text. As a codec that works on bytes, gzip tends to work quite well on bytecoded files, and poorly on bit-coding schemes like huffman codes. A gzipped bytecoded file is usually smaller than a gzipped raw file and smaller than a gzipped huffman-coded file.


I'll close with a link to the 254 most common substrings in one corpus of JavaScript code that I analyzed: here. I have a set of patches to integrate a codec optimized for JavaScript source into V8, to try to reduce the memory footprint of JS source code. More on that quixotic endeavor in some future post. Until then, happy scheming!

Syndicated 2012-12-12 16:57:56 from wingolog

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!