An ode to pack: gzip’s forgotten decompressor

The latest 4.13.9 source release of the Linux kernel is 780MiB, but thanks to xz compression, the download is a much more managable 96 MiB (an 88% reduction)

Before xz took over as the default compression format on in 2013, following the "latest" link would have gotten you a bzip2 compressed file. The tar.bz2 would have been 115 MiB (-85%), but there’s was no defending the extra 20 MiB after xz caught up in popularity. bzip2 is all but displaced today.

bzip2 became the default in 2003, though it had long been an option over the less efficient gzip. However, since every OS, browser, language core library, phone and IoT lightswitch has built-in support for gzip, a 148 MiB (-81%) tar.gz remains an option even today.

gzip itself started taking over in 1994, before, and before the World Wide Web went mainstream. It must have been a particularly easy sell for the fledgeling Linux kernel: it was made, used and endorsed by the mighty GNU project, it was Free Software, free of patent restrictions, and it provided powerful .zip style DEFLATE compression in a Unix friendly package.

Another nice benefit was that gzip could decompress other contemporary formats, thereby replacing contested and proprietary software.

Among the tools it could replace was compress, the de-facto Unix standard at the time. Created based on LZW in 1985, it was hampered by the same patent woes that plagued GIF files. The then-ubiquitous .Z suffix graced the first public Linux releases, but is now recognized only by the most long-bearded enthusiasts. The current release would have been 302 MiB (-61%) with compress.

Another even more obscure tool it could replace was compress‘s own predecessor, pack. This rather loosely defined collection of only partially compatible formats is why compress had to use a capital Z in its extension. pack came first, and offered straight Huffman coding with a .z extension.

With pack, our Linux release would have been 548 MiB (-30%). Compared to xz‘s 96 MiB, it’s obvious why no one has used it for decades.

Well, guess what: gzip never ended its support! Quoth the man page,

gunzip can currently decompress files created by gzip, zip,
compress, compress -H or pack.

While multiple implementations existed, these were common peculiarities:

  • They could not be used in pipes.
  • They could not represent empty files.
  • They could not compress a file with only one byte value, e.g. "aaaaaa…"
  • They could fail on "large" files. "can’t occur unless [file size] >= [16MB]", a comment said dismissively, from the time when a 10MB hard drive was a luxury few could afford.

These issues stemmed directly from the Huffman coding used. Huffman coding, developed in 1952, is basically an improvement on Morse code, where common characters like "e" get a short code like "011", while uncommon "z" gets a longer one like "111010".

  • Since you have to count the characters to figure out which are common, you can not compress in a single pass in a pipe. Now that memory is cheap, you could mostly get around that by keeping the data in RAM.

  • Empty files and single-valued files hit an edge case: if you only have a single value, the shortest code for it is the empty string. Decompressors that didn’t account for it would get stuck reading 0 bits forever. You can get around it by adding unused dummy symbols to ensure a minimum bit length of 1.

  • A file over 16MB could cause a single character to be so rare that its bit code was 25+ bits long. A decompressor storing the bits to be decoded in a 32bit value (a trick even gzip uses) would be unable to append a new 8bit byte to the buffer without displacing part of the current bit code. You can get around that by using "package merge" length restricted prefix codes over naive Huffman codes.

I wrote a Haskell implementation with all these fixes in place: koalaman/pack is available on GitHub.

During development, I found that pack support in gzip had been buggy since 2012 (version 1.6), but no one had noticed in the five years since. I tracked down the problem and I’m happy to say that version 1.9 will again restore full pack support!

Anyways, what could possibly be the point of using pack today?

There is actually one modern use case: code golfing.

This post came about because I was trying to implement the shortest possible program that would output a piece of simple ASCII art. A common trick is variations of a self-extracting shell script:

sed 1d $0|gunzip;exit
<compressed binary data here>

You can use any available compressor, including xz and bzip2, but these were meant for bigger files and have game ruining overheads. Here’s the result of compressing the ASCII art in question:

  • raw: 269 bytes
  • xz: 216 bytes
  • bzip2: 183 bytes
  • gzip: 163 bytes
  • compress: 165 bytes
  • and finally, pack: 148 bytes!

I was able to save 15 bytes by leveraging gzip‘s forgotten legacy support. This is huge in a sport where winning entries are bytes apart.

Let’s have a look at this simple file format. Here’s an example pack file header for the word "banana":

1f 1e        -- Two byte magic header
00 00 00 06  -- Original compressed length (6 bytes)

Next comes the Huffman tree. Building it is simple to do by hand, but too much for this post. It just needs to be complete, left-aligned, with eof on the right at the deepest level. Here’s the optimal tree for this string:

       /  a
     /  \
    /\   n
   b  eof

We start by encoding its depth (3), and the number of leaves on each level. The last level is encoded minus 2, because the lowest level will have between 2 and 257 leaves, while a byte can only store 0-255.

03  -- depth
01  -- level 1 only contains 'a'
01  -- level 2 only contains 'n'
00  -- level 3 contains 'b' and 'eof', -2 as mentioned

Next we encode the ASCII values of the leaves in the order from top to bottom, left to right. We can leave off the EOF (which is why it needs to be in the lower right):

61 6e 62  -- "a", "n" ,"b"

This is enough for the decompressor to rebuild the tree. Now we go on to encode the actual data.

Starting from the root, the Huffman codes are determined by adding a 0 for ever left branch and 1 for every right branch you have to take to get to your value:

a   -> right = 1
n   -> left+right = 01
b   -> left+left+left -> 000
eof -> left+left+right -> 001

banana<eof> would therefore be 000 1 01 1 01 1 001, or when grouped as bytes:

16  -- 0001 0110
C8  -- 1100 1   (000 as padding)

And that’s all we need:

$ printf '\x1f\x1e\x00\x00\x00\x06'\
'\x03\x01\x01\x00\x61\x6e\x62\x16\xc8' | gzip -d

Unfortunately, the mentioned gzip bug triggers due to failing to account for leading zeroes in bit code. eof and a have values 001 and 1, so an oversimplified equality check confuses one for the other, causing gzip to terminate early:

gzip: stdin: invalid compressed data--length error

However, if you’re stuck with an affected version, there’s another party trick you can do: the Huffman tree has to be canonical, but it does not have to be optimal!

What would happen if we skipped the count and instead claimed that each ASCII character is equally likely? Why, we’d get a tree of depth 8 where all the leaf nodes are on the deepest level.

It then follows that each 8 bit character will be encoded as 8 bits in the output file, with the bit patterns we choose by ordering the leaves.

Let’s add a header with a dummy length to a file:

$ printf '\x1F\x1E____' > myfile.z

Now let’s append the afforementioned tree structure, 8 levels with all nodes in the last one:

$ printf '\x08\0\0\0\0\0\0\0\xFE' >> myfile.z

And let’s populate the leaf nodes with 255 bytes in an order of our choice:

$ printf "$(printf '\\%o' {0..254})" |
    tr 'A-Za-z' 'N-ZA-Mn-za-m' >> myfile.z

Now we can run the following command, enter some text, and hit Ctrl-D to "decompress" it:

$ cat myfile.z - | gzip -d 2> /dev/null
Jr unir whfg pbaivaprq TMvc gb hafpenzoyr EBG13!
We have just convinced GZip to unscramble ROT13!

Can you think of any other fun ways to use or abuse gzip‘s legacy support? Post a comment.

Buffers and windows: The mystery of ‘ssh’ and ‘while read’ in excessive detail

If you’ve ever tried to use ssh, and similarly ffmpeg or mplayer, in a while read loop, you’ll have stumbled upon a surprising interaction: the loop mysteriously aborts after the first iteration!

The solution, using ssh -n or ssh < /dev/null, is quickly spoiled by ShellCheck (shellcheck this code online), but why stop there? Let’s take a deep dive into the technical details surrounding this issue.

Note that all numbers given here are highly tool and platform specific. They apply to GNU coreutils 8.26, Linux 4.9 and OpenSSH 7.4, as found on Debian in July, 2017. If you use a different platform, or even just sufficiently newer versions, and wish to repeat the experiments, you may have to follow the reasoning and update the numbers accordingly.

Anyways, to first demonstrate the problem, here’s a while read loop that runs a command for each line in a file:

while IFS= read -r host
  echo ssh "$host" uptime
done < hostlist.txt

It works perfectly and iterates over all lines in the file:

ssh localhost uptime
ssh uptime
ssh uptime

However, if we remove the echo and actually run ssh, it will stop after the first iteration with no warnings or errors:

 16:12:41 up 21 days,  4:24, 12 users,  load average: 0.00, 0.00, 0.00

Even uptime itself works fine, but ssh localhost uptime will stop after the first one, even though it runs the same command on the same machine.

Of course, applying the aforementioned fix, ssh -n or ssh < /dev/null solves the problem and gives the expected result:

16:14:11 up 21 days,  4:24, 12 users,  load average: 0.00, 0.00, 0.00
16:14:11 up 14 days,  6:59, 15 users,  load average: 0.00, 0.00, 0.00
01:14:13 up 73 days, 13:17,  8 users,  load average: 0.08, 0.15, 0.11

If we merely wanted to fix the problem though, we'd just have followed ShellCheck's advice from the start. Let's keep digging.

You see similar behavior if you try to use ffmpeg to convert media or mplayer to play it. However, now it's even worse: not only does it stop after one iteration, it may abort in the middle of the first one!

All other commands work fine -- even other converters, players and ssh-based commands like sox, vlc and scp. Why do certain commands fail?

The root of the problem is that when you pipe or redirect to a while read loop, you're not just redirecting to read but to the entire loop body. Everything in both condition and body will share the same file descriptor for standard input. Consider this loop:

while IFS= read -r line
  cat > rest
done < file.txt

First read will successfully read a line and start the first iteration. Then cat will read from the same input source, where read left off. It reads until EOF and exits, and the loop iterates. read again tries to read from the same input, which remains at EOF. This terminates the loop. In effect, the loop only iterated once.

The question remains, though: why do our three commands in particular drain stdin?

ffmpeg and mplayer are simple cases. They both accept keyboard controls from stdin.

While ffmpeg encodes a video, you can use '+' to make the process more verbose or 'c' to input filter commands. While mplayer plays a video, you can use 'space' to pause or 'm' to mute. The commands drain stdin while processing these keystrokes.

They both also share a shortcut to quit: they will stop abruptly if any of the input they read is a "q".

But why ssh? Shouldn't it mirror the behavior of the remote command? If uptime doesn't read anything, why should ssh localhost uptime?

The Unix process model has no good way to detect when a process wants input. Instead, ssh has to preemptively read data, send it over the wire, and have sshd offer it on a pipe to the process. If the process doesn't want it, there's no way to return the data to the FD from whence it came.

We get a toy version of the same problem with cat | uptime. Output in this case is the same as when using ssh localhost uptime:

 16:25:51 up 21 days,  4:34, 12 users,  load average: 0.16, 0.03, 0.01

In this case, cat will read from stdin and write to the pipe until the pipe's buffer is full, at which time it'll block until something reads. Using strace, we can see that GNU cat from coreutils 8.26 uses a 128KiB buffer -- more than Linux's current 64KiB pipe buffer -- so one 128KiB buffer is the amount of data we can expect to lose.

This implies that the loop doesn't actually abort. It will continue if there is still data left after 128KiB has been read from it. Let's try that:

  echo first
  for ((i=0; i < 16384; i++)); do echo "garbage"; done
  echo "second"
} > file

while IFS= read -r line
  echo "Read $line"
  cat | uptime > /dev/null
done < file

Here, we write 16386 lines to the file. "first", 16384 lines of "garbage", followed by "second". "garbage" + linefeed is 8 bytes, so 16384 of them make up exactly 128KiB. The file prevents any race conditions between producer and consumer.

Here's what we get:

Read first
Read second

If we add a single line additional line of "garbage", we'll see that instead. If we write one less, "second" disappears. In other words, the expected 128KiB of data were lost between iterations.

ssh has to do the same thing, but more: it has to read input, encrypt it, and transmit it over the wire. On the other side, sshd receives it, decrypts it, and feeds it into the pipe. Both sides work asynchronously in duplex mode, and one side can shut down the channel at any time.

If we use ssh localhost uptime we're racing to see how much data we can push before sshd notifies us that the command has already exited. The faster the computer and slower the roundtrip time, the more we can write. To avoid this and ensure deterministic results, we'll use sleep 5 instead of uptime from now on.

Here's one way of measuring how much data we write:

$ tee >(wc -c >&2) < /dev/zero | { sleep 5; }

Of course, by showing how much it writes, it doesn't directly show how much sleep reads: the 65536 bytes here is the Linux pipe buffer size.

This is also not a general way to get exact measurements because it relies on buffers aligning perfectly. If nothing is reading from the pipe, you can successfully write two blocks of 32768 bytes, but only one block of 32769.

Fortunately, GNU tee currently uses a buffer size of 8192, so given 8 full reads, it will perfectly fill the 65536 byte pipe buffer. strace also reveals that ssh (OpenSSH 7.4) uses a buffer size of 16384, which is exactly 2x of tee and 1/4x of the pipe buffer, so they will all align nicely and give an accurate count.

Let's try it with ssh:

$ tee >(wc -c >&2) < /dev/zero | ssh localhost sleep 5

As discussed, we'll subtract the pipe buffer, so we can surmise that 2162688 bytes has been read by ssh. We can verify this manually with strace if we want. But why 2162688?

On the other side, sshd has to feed this data into sleep through a pipe with no readers. That's another 65536. We're now left with 2097152 bytes. How can we account for these?

This number is in fact the OpenSSH transport layer's default window size for non-interactive channels!

Here's an excerpt from channels.h in the OpenSSH source code:

/* default window/packet sizes for tcp/x11-fwd-channel */
#define CHAN_SES_PACKET_DEFAULT	(32*1024)

There it is: 64*32*1024 = 2097152.

If we adapt our previous example to use ssh anyhost sleep 5 and write "garbage"
(64*32*1024+65536)/8 = 270336 times, we can again game the buffers and cause our iterator to get exactly the lines we want:

  echo first
  for ((i=0; i < $(( (64*32*1024 + 65536) / 8)); i++)); do echo "garbage"; done
  echo "second"
} > file

while IFS= read -r line
  echo "Read $line"
  ssh localhost sleep 5
done < file

Again, this results in:

Read first
Read second

An entirely useless experiment of course, but pretty nifty!

Swearing in the Linux kernel: now interactive

Graphs showing a rise in "crap" and fall in "fuck" over time.

If you’ve followed discussions on Linux, you may at some point have bumped into a funny graph showing how many times frustrated Linux kernel developers have put four letter words into the source code.

Today, for the first time in 12 years, it’s gotten a major revamp!

You can now interactively plot any words of your choice with commit level granularity.


Did you find any interesting insights? Post a comment!

Why Bash Is Like That: Rewrite hacks

Bash can seem pretty random and weird at times, but most of what people see as quirks have very logical (if not very good) explanations behind them. This series of posts looks at some of them.

Let’s say you wanted to enforce a policy in which no files on the system could contain swearing. How would you write a script that checks it? Let’s use the word “damn”, and let’s write a script “checklanguage” that checks whether a file contains that word.

Our first version might be:

#!/usr/bin/env bash
grep -q "damn" "$@" 

The problem with this is that it triggers on itself: ./checklanguage checklanguage returns true. How can we write the script in such a way that it reliably detects the word, but doesn’t detect itself? (Think about it for a second).

There are many ways of doing this: a="da"; b="mn"; grep "$a$b", grep "da""mn", grep da\mn. All of these check for the four characters d-a-m-n in sequence, but doesn’t contain the sequence itself. These methods rely on two things being A. identical in one context (shell script) and B. different in another (plaintext).

This type of trick is the basis of three common command line hacks:

Finding processes from ps, while excluding the grep that does the filtering.

If we do a simple ps ax | grep processname, we might get output like this:

$ ps ax | grep processname
13003 pts/2    S      0:00 /bin/bash ./processname
13496 pts/4    R+     0:00 grep --color=auto processname

How do we get the same list, but without the grep process? You’ll see people wrapping the first character in square brackets:

$ ps ax | grep "[p]rocessname"
13003 pts/2    S      0:00 /bin/bash ./processname

In this case, the regex “[p]rocessname” is identical to the regex “processname”, but since they’re written differently, the latter matches itself while the former doesn’t. This means that the grep won’t match itself, and we only get the process we’re interested in (this job is better done by pgrep).

There is no syntax rule that says “if the first character is enclosed in square brackets, grep shall ignore itself in ps output”.

It’s just a logical side effect of rewriting the regex to work the same but not match itself. We could have used grep -E 'process()name' or grep -E 'proces{2}name' instead.

Running commands instead of aliases

Maybe you’re sick of Debian’s weird perl rename, and you aliased it to rename.ul instead.

$ rename -v .htm .html *
`foo.htm' -> `foo.html'

Yay, that’s way easier than writing regex! But what if we need to use the unaliased rename?

$ rename -v 's/([1-9])x([0-9]*)/S$1E$2/' *
rename.ul: not enough arguments

Instead, you’ll see people prefixing the command with a backslash:

$ \rename -v 's/([1-9])x([0-9]*)/S0$1E$2/' *
Foo_1x20.mkv renamed as Foo_S01E20.mkv

Shell aliases trigger when a command starts with a word. However, if the command starts with something that expands into a word, alias expansion does not apply. This allows us to use e.g. \ls or \git to run the command instead of the alias.

There is no syntax rule that says that “if a command is preceded by a backslash, alias expansion is ignored”.

It’s just a logical side effect of rewriting the command to work the same, but not start with a literal token that the shell will recognize as an alias. We could also have used l\s or 'ls'.

Deleting files starting with a dash

How would you go about deleting a file that starts with a dash?

$ rm -v -file
rm: invalid option -- 'l'

Instead, you’ll see people prefixing the filename with ./:

$ rm -v ./-file
removed `./-file'

A command will interpret anything that starts with a dash as a flag. However, to the file system, -file and ./-file mean exactly the same thing.

There is no syntax rule that says that “if an argument starts with ./, it shall be interpretted as a filename and not an option”.

It’s just a logical side effect of rewriting a filename to refer to the same file, but start with a different character. We could have used rm /home/me/-file or rm ../me/-file instead.

Homework: What do you tell someone who thinks that ./myscript is a perfect example of how weird UNIX is? Why would anyone design a system where the run command is “./” instead of “run”?

Basics of a Bash action game

If you want to write an action game in bash, you need the ability to check for user input without actually waiting for it. While bash doesn’t let you poll the keyboard in a great way, it does let you wait for input for a miniscule amount of time with read -t 0.0001.

Here’s a snippet that demonstrates this by bouncing some text back and forth, and letting the user control position and color. It also sets (and unsets) the necessary terminal settings for this to look good:

#!/usr/bin/env bash

# Reset terminal on exit
trap 'tput cnorm; tput sgr0; clear' EXIT

# invisible cursor, no echo
tput civis
stty -echo

text="j/k to move, space to color"
max_x=$(($(tput cols) - ${#text}))
dir=1 x=1 y=$(($(tput lines)/2))

while sleep 0.05 # GNU specific!
    # move and change direction when hitting walls
    (( x == 0 || x == max_x )) && \
        ((dir *= -1))
    (( x += dir ))

    # read all the characters that have been buffered up
    while IFS= read -rs -t 0.0001 -n 1 key
        [[ $key == j ]] && (( y++ ))
        [[ $key == k ]] && (( y-- ))
        [[ $key == " " ]] && color=$((color%7+1))

    # batch up all terminal output for smoother action
        tput cup "$y" "$x"
        tput setaf "$color"
        printf "%s" "$text"

    # dump to screen
    printf "%s" "$framebuffer"

Making bash run DOS/Windows CRLF EOL scripts

If you for any reason use a Windows editor to write scripts, it can be annoying to remember to convert them and bash fails in mysterious ways when you don’t. Let’s just get rid of that problem once and for all:

cat > $'/bin/bash\r' << "EOF"
#!/usr/bin/env bash
exec bash <(tr -d '\r' < "$script") "$@"

This allows you to execute scripts with DOS/Windows \r\n line endings with ./yourscript (but it will fail if the script specifies parameters on the shebang, or if you run it with bash yourscript). It works because from a UNIX point of view, DOS/Windows files specify the interpretter as "bash^M", and we override that to clean the script and run bash on the result.

Of course, you can also replace the helpful exec bash part with echo "Run dos2unix on your file!" >&2 if you'd rather give your users a helpful reminder rather than compatibility or a crazy error.