compsci – Vidar's Blog

An ode to pack: gzip’s forgotten decompressor

The latest 4.13.9 source release of the Linux kernel is 780MiB, but thanks to xz compression, the download is a much more managable 96 MiB (an 88% reduction)

Before xz took over as the default compression format on kernel.org in 2013, following the "latest" link would have gotten you a bzip2 compressed file. The tar.bz2 would have been 115 MiB (-85%), but there’s was no defending the extra 20 MiB after xz caught up in popularity. bzip2 is all but displaced today.

bzip2 became the default in 2003, though it had long been an option over the less efficient gzip. However, since every OS, browser, language core library, phone and IoT lightswitch has built-in support for gzip, a 148 MiB (-81%) tar.gz remains an option even today.

gzip itself started taking over in 1994, before kernel.org, and before the World Wide Web went mainstream. It must have been a particularly easy sell for the fledgeling Linux kernel: it was made, used and endorsed by the mighty GNU project, it was Free Software, free of patent restrictions, and it provided powerful .zip style DEFLATE compression in a Unix friendly package.

Another nice benefit was that gzip could decompress other contemporary formats, thereby replacing contested and proprietary software.

Among the tools it could replace was compress, the de-facto Unix standard at the time. Created based on LZW in 1985, it was hampered by the same patent woes that plagued GIF files. The then-ubiquitous .Z suffix graced the first public Linux releases, but is now recognized only by the most long-bearded enthusiasts. The current release would have been 302 MiB (-61%) with compress.

Another even more obscure tool it could replace was compress‘s own predecessor, pack. This rather loosely defined collection of only partially compatible formats is why compress had to use a capital Z in its extension. pack came first, and offered straight Huffman coding with a .z extension.

With pack, our Linux release would have been 548 MiB (-30%). Compared to xz‘s 96 MiB, it’s obvious why no one has used it for decades.

Well, guess what: gzip never ended its support! Quoth the man page,

gunzip can currently decompress files created by gzip, zip,
compress, compress -H or pack.

While multiple implementations existed, these were common peculiarities:

They could not be used in pipes.
They could not represent empty files.
They could not compress a file with only one byte value, e.g. "aaaaaa…"
They could fail on "large" files. "can’t occur unless [file size] >= [16MB]", a comment said dismissively, from the time when a 10MB hard drive was a luxury few could afford.

These issues stemmed directly from the Huffman coding used. Huffman coding, developed in 1952, is basically an improvement on Morse code, where common characters like "e" get a short code like "011", while uncommon "z" gets a longer one like "111010".

Since you have to count the characters to figure out which are common, you can not compress in a single pass in a pipe. Now that memory is cheap, you could mostly get around that by keeping the data in RAM.
Empty files and single-valued files hit an edge case: if you only have a single value, the shortest code for it is the empty string. Decompressors that didn’t account for it would get stuck reading 0 bits forever. You can get around it by adding unused dummy symbols to ensure a minimum bit length of 1.
A file over 16MB could cause a single character to be so rare that its bit code was 25+ bits long. A decompressor storing the bits to be decoded in a 32bit value (a trick even gzip uses) would be unable to append a new 8bit byte to the buffer without displacing part of the current bit code. You can get around that by using "package merge" length restricted prefix codes over naive Huffman codes.

I wrote a Haskell implementation with all these fixes in place: koalaman/pack is available on GitHub.

During development, I found that pack support in gzip had been buggy since 2012 (version 1.6), but no one had noticed in the five years since. I tracked down the problem and I’m happy to say that version 1.9 will again restore full pack support!

Anyways, what could possibly be the point of using pack today?

There is actually one modern use case: code golfing.

This post came about because I was trying to implement the shortest possible program that would output a piece of simple ASCII art. A common trick is variations of a self-extracting shell script:

sed 1d $0|gunzip;exit
<compressed binary data here>

You can use any available compressor, including xz and bzip2, but these were meant for bigger files and have game ruining overheads. Here’s the result of compressing the ASCII art in question:

raw: 269 bytes
xz: 216 bytes
bzip2: 183 bytes
gzip: 163 bytes
compress: 165 bytes
and finally, pack: 148 bytes!

I was able to save 15 bytes by leveraging gzip‘s forgotten legacy support. This is huge in a sport where winning entries are bytes apart.

Let’s have a look at this simple file format. Here’s an example pack file header for the word "banana":

1f 1e        -- Two byte magic header
00 00 00 06  -- Original compressed length (6 bytes)

Next comes the Huffman tree. Building it is simple to do by hand, but too much for this post. It just needs to be complete, left-aligned, with eof on the right at the deepest level. Here’s the optimal tree for this string:

        /\
       /  a
      /\
     /  \
    /\   n
   b  eof

We start by encoding its depth (3), and the number of leaves on each level. The last level is encoded minus 2, because the lowest level will have between 2 and 257 leaves, while a byte can only store 0-255.

03  -- depth
01  -- level 1 only contains 'a'
01  -- level 2 only contains 'n'
00  -- level 3 contains 'b' and 'eof', -2 as mentioned

Next we encode the ASCII values of the leaves in the order from top to bottom, left to right. We can leave off the EOF (which is why it needs to be in the lower right):

61 6e 62  -- "a", "n" ,"b"

This is enough for the decompressor to rebuild the tree. Now we go on to encode the actual data.

Starting from the root, the Huffman codes are determined by adding a 0 for ever left branch and 1 for every right branch you have to take to get to your value:

a   -> right = 1
n   -> left+right = 01
b   -> left+left+left -> 000
eof -> left+left+right -> 001

banana<eof> would therefore be 000 1 01 1 01 1 001, or when grouped as bytes:

16  -- 0001 0110
C8  -- 1100 1   (000 as padding)

And that’s all we need:

$ printf '\x1f\x1e\x00\x00\x00\x06'\
'\x03\x01\x01\x00\x61\x6e\x62\x16\xc8' | gzip -d
banana

Unfortunately, the mentioned gzip bug triggers due to failing to account for leading zeroes in bit code. eof and a have values 001 and 1, so an oversimplified equality check confuses one for the other, causing gzip to terminate early:

b
gzip: stdin: invalid compressed data--length error

However, if you’re stuck with an affected version, there’s another party trick you can do: the Huffman tree has to be canonical, but it does not have to be optimal!

What would happen if we skipped the count and instead claimed that each ASCII character is equally likely? Why, we’d get a tree of depth 8 where all the leaf nodes are on the deepest level.

It then follows that each 8 bit character will be encoded as 8 bits in the output file, with the bit patterns we choose by ordering the leaves.

Let’s add a header with a dummy length to a file:

$ printf '\x1F\x1E____' > myfile.z

Now let’s append the afforementioned tree structure, 8 levels with all nodes in the last one:

$ printf '\x08\0\0\0\0\0\0\0\xFE' >> myfile.z

And let’s populate the leaf nodes with 255 bytes in an order of our choice:

$ printf "$(printf '\\%o' {0..254})" |
    tr 'A-Za-z' 'N-ZA-Mn-za-m' >> myfile.z

Now we can run the following command, enter some text, and hit Ctrl-D to "decompress" it:

$ cat myfile.z - | gzip -d 2> /dev/null
Jr unir whfg pbaivaprq TMvc gb hafpenzoyr EBG13!
<Ctrl+D>
We have just convinced GZip to unscramble ROT13!

Can you think of any other fun ways to use or abuse gzip‘s legacy support? Post a comment.

Parameterized Color Cell Compression

I came across a quaint and adorable paper from SIGGRAPH’86: Two bit/pixel Full Color Encoding. It describes Color Cell Compression, an early ancestor of Adaptive Scalable Texture Compression which is all the rage these days.

Like ASTC, it offers a fixed 2 bits/pixel encoding for color images. However, the first of many d’awwws in this paper comes as early as the second line of the abstract, when it suggests that a fixed rate is useful not for the random access we covet for rendering today, but simply for doing local image updates!

The algorithm can compress a 640×480 image in just 11 seconds on a 3MHz VAX 11/750, and decompress it basically in real time. This means that it may allow full color video, unlike these impractical, newfangled transform based algorithms people are researching.

CCC actually works astonishingly well. Here’s our politically correct Lenna substitute:

The left half of the image is 24bpp, while the right is is 2bpp. Really the only way to tell is in the eyes, and I’m sure there’s an interesting, evolutionary explanation for that.

If we zoom in, we can get a sense of what’s going on:

The image is divided into 4×4 cells, and each cell is composed of only two different colors. In other words, the image is composed of 4×4 bitmaps with a two color palette, each chosen from an image-wide 8bit palette. A 4×4 bitmap would take up 16 bits, and two 8bit palette indices would take up 16 bits, for a total of 32 bits per 16 pixels — or 2 bits per pixel.

The pixels in each cell are divided into two groups based on luminosity, and each group gets its own color based on the average color in the group. One of the reasons this works really well, the author says, is because video is always filmed so that a change in chromaticity has an associated change in luminosity — otherwise on-screen features would be invisible to the folks at home who still have black&white TVs!

We now know enough to implement this lovely algorithm: find an 8bit palette covering the image, then for each cell of 4×4 pixels, divide the pixels into two groups based on whether their luminosity is over or under the cell average. Find the average color of each part, and find its closest match in the palette.

However, let’s experiment! Why limit ourselves to 4×4 cells with 2 colors each from a palette of 256? What would happen if we used 8×8 cells with 3 colors each from a palette of 512? That also comes out to around 2 bpp.

Parameterizing palette and cell size is easy, but how do we group pixels into k colors based on luminosity? Simple: instead of using the mean, we use k-means!

Here’s a colorful parrot in original truecolor on the left, CCC with 4×4 cells in the middle, and 32×32 cells (1.01 bpp) on the right. Popartsy!

Here’s what we get if we only allow eight colors per horizontal line. The color averaging effect is really pronounced:

And here’s 3 colors per 90×90 cell:

The best part about this paper is the discussion of applications. For 24fps video interlaced at 320×480, they say, you would need a transfer rate of 470 kb/s. Current microcomputers have a transfer rate of 625 kb/s, so this is well within the realm of possibility. Today’s standard 30 megabyte hard drives could therefore store around 60 seconds of animation!

Apart from the obvious benefits of digital video like no copy degradation and multiple resolutions, you can save space when panning a scene by simply transmitting the edge in the direction of the pan!

You can even use CCC for electronic shopping. Since the images are so small and decoding so simple, you can make cheap terminals in great quantities, transmit images from a central location and provide accompanying audio commentary via cable!

In addition to one-to-many applications, you can have one-to-one electronic, image based communication. In just one minute on a 9600bps phone connection, a graphic arts shop can transmit a 640×480 image to clients for approval and comment.

You can even do many-to-many teleconferencing! Imagine the ability to show the speaker’s face, or a drawing they wish to present to the group on simple consumer hardware!

Truly, the future is amazing.

Here’s the JuicyPixel based Haskell implementation I used. It doesn’t actually encode the image, it just simulates the degradation. Ironically, this version is slower than the authors’ original, even though the hardware is five or six orders of magnitude faster!

Apart from the parameterization, I added two other improvements: Firstly, instead of the naive RGB based average suggestion in the paper, it uses the YCrCb average. Second, instead of choosing the palette from the original image, it chooses it from the averages. This doesn’t matter much for colorful photograph, but gives better results for images lacking gradients.

Technically correct: Inversed regex

How do I write a regex that matches lines that do not contain hi?

That’s a frequently asked question if I ever saw one.

Of course, the proper answer is: you don’t. You write a regex that does match hi and then invert the matching logic, ostensibly with grep -v. But where’s the fun in that?

One interesting theorem that pops up in any book or class on formal grammars is that regular languages are closed under complement: the inverse of a regular expression is also a regular expression. This means that writing inverted regular expressions is theoretically possible, though it turns out to be quite tricky

Just try writing a regex that matches strings that does not contain “hi”, and test it against “hi”, “hhi” and “ih”, “iih” and such variations. Some solutions are coming up.

A way to cheat is using PCRE negative lookahead: ^(?!.*foo) matches all strings not containing the substring “foo”. However, lookahead assertions require a stack, and thus can’t be modelled as a finite state machine. In other words, they don’t fit the mathematical definition of a regular expression, and therefore disqualify.

There are simple, well-known algorithms for turning regular expressions into non-deterministic finite automata, and from there to deterministic FA. Less commonly used and known are algorithms for inverting a DFA and for generating familiar textual regex from it.

You can find these described in various lecture notes and slides, so I won’t recite them.

What I had a harder time finding was software that actually did this. So here is a Haskell program. It’s highly suboptimal but it does the job. When executed, it will ask for a regex and will then output a grep command that matches everything the regex does not (without -v, obviously).

The expressions it produces are quite horrific; it’s computer generated code, after all.

A regular expression for matching strings that do not match .*hi.* could be grep -E '^([^h]|h+$|h+[^hi])*$'.

This app, however, suggests grep -E '^([^h]([^h]|)*||([^h][^h]*h|h)|([^h][^h]*(h(hh*[^hi]|[^hi]))|(h(hh*[^hi]|[^hi])))((hh*[^hi]|[^h])|)*|([^h][^h]*(h([^hi][^h]*h|h))|(h([^hi][^h]*h|h)))(([^hi][^h]*h|h)|)*)$'

It still works exactly as stated though!

The app just supports a small subset of regex, just enough to convince someone that it works, and as a party trick lets you answer the original question exactly as stated.

Technically correct is the best kind of correct.