A shell script that deleted a database, and how ShellCheck could have helped

Summary: We examine a real world case of how an innocent shell scripting mistake caused the deletion of a production database, and how ShellCheck (a GPLv3 shell script linting and analysis tool) would have pointed out the errors and prevented the disaster.

Disclosure: I am the ShellCheck author.

The event

Here is the sad case, taken from a recent StackOverflow post:

My developer committed a huge mistake and we cannot find our mongo database anyone in the server. Rescue please!!!

He logged into the server, and saved the following shell under ~/crontab/mongod_back.sh:

#!/bin/sh
DUMP=mongodump
OUT_DIR=/data/backup/mongod/tmp     // 备份文件临时目录
TAR_DIR=/data/backup/mongod         // 备份文件正式目录
DATE=`date +%Y_%m_%d_%H_%M_%S`      // 备份文件将以备份时间保存
DB_USER=Guitang                     // 数据库操作员
DB_PASS=qq____________              // 数据库操作员密码
DAYS=14                             // 保留最新14夭的备份
TAR_BAK="mongod_bak_$DATE.tar.gz"   // 备份文件命名格式
cd $OUT_DIR                         // 创建文件夹
rm -rf $OUT_DIR/*                   // 清空临时目录
mkdir -p $OUT_DIR/$DATE             // 创建本次备份文件夹
$DUMP -d wecard -u $DB_USER -p $DB_PASS -o $OUT_DIR/$DATE  // 执行备份命令
tar -zcvf $TAR_DIR/$TAR_BAK $OUT_DIR/$DATE       // 将备份文件打包放入正式目
find $TAR_DIR/ -mtime +%DAYS -delete             // 删除14天前的旧备洲

And then he run ./mongod_back.sh, then there were lots of permission denied, then he did Ctrl+C. Then the server shut down automatically.

He then contacted AliCloud, the engineer connected the disk to another working server, so that he could check the disk. Then, he realized that some folders have gone, including /data/ where the mongodb is!!!

PS: he did not take snapshot of the disk before.

Essentially, it’s every engineer’s nightmare.

The post-mortem of this issue is an interesting puzzle that requires only basic shell scripting knowledge. If you’d like to give it a try, now’s the time. If you’d like some hints, here’s shellcheck’s output for the script.

The rest of this post details about what happened, and how ShellCheck could have averted the disaster.

What went wrong?

The MCVE for how to ruin your week is this:

#!/bin/sh
DIR=/data/tmp    // The directory to delete
rm -rf $DIR/*    // Now delete it

The fatal error here is that // is not a comment in shell scripts. It’s a path to the root directory, equivalent to /.

On some platforms, the rm line would have been fatal by itself, because it’d boil down to rm -rf / with a few other arguments. Implementation these days often don’t allow this though. The disaster in question happened on Ubuntu, whose GNU rm would have refused:

$ rm -rf //
rm: it is dangerous to operate recursively on '//' (same as '/')
rm: use --no-preserve-root to override this failsafe

This is where the assignment comes in.

The shell treats variable assignments and commands as two sides of the same coin. Here’s the description from POSIX:

A “simple command” is a sequence of optional variable assignments and redirections, in any sequence, optionally followed by words and redirections, terminated by a control operator.

(A “simple command” is in contrast to a “compound” command, which are structures like if statements and for loops that contain one or more simple or compound commands.)

This means that var=42 and echo "Hello" are both simple commands. The former has one optional assignment and zero optional words. The latter has zero optional assignments and two optional words.

It also implies that a single simple command can contain both: var=42 echo "Hello"

To make a long spec short, assignments in a simple command will apply only to the invoked command name. If there is no command name, they apply to the current shell. This latter explains var=42 by itself, but when would you use the former?

It’s useful when you want to set a variable for a single command without affecting your the rest of your shell:

$ echo "$PAGER"  # Show current pager
less

$ PAGER="head -n 5" man ascii
ASCII(7)       Linux Programmer's Manual      ASCII(7)

NAME
       ascii  -  ASCII character set encoded in octal,
       decimal, and hexadecimal

$ echo "$PAGER"  # Current pager hasn't changed
less

This is exactly what happened unintentionally in the fatal assignment. Just like how the previous example scoped PAGER to man only, this one scoped DIR to //:

$ DIR=/data/tmp    // The directory to delete
bash: //: Is a directory

$ echo "$DIR"  # The variable is unset
(no output)

This meant that rm -rf $DIR/* became rm -rf /*, and therefore bypassed the check that was is in place for rm -rf /

(Why can’t or won’t rm simply refuse to delete /* too? Because it never sees /*: the shell expands it first, so rm sees /bin /boot /dev /data .... While rm could obviously refuse to remove first level directories as well, this starts getting in the way of legitimate usage – a big sin in the Unix philosophy)

How ShellCheck could have helped

Here’s the output from this minimized snippet (see online):

$ shellcheck myscript

In myscript line 2:
DIR=/data/tmp    // The directory to delete
                 ^-- SC1127: Was this intended as a comment? Use # in sh.


In myscript line 3:
rm -rf $DIR/*    // Now delete it
       ^----^ SC2115: Use "${var:?}" to ensure this never expands to /* .
       ^--^ SC2086: Double quote to prevent globbing and word splitting.
                 ^-- SC2114: Warning: deletes a system directory.

Two issues have already been discussed, and would have averted this disaster:

  • ShellCheck noticed that the first // was likely intended as a comment (wiki: SC1127).
  • ShellCheck pointed out that the second // would target a system directory (wiki: SC2114).

The third is a general defensive technique which would also have prevented this catastrophic rm independently of the two other fixes:

  • ShellCheck suggested using rm -rf ${DIR:?}/* to abort execution if the variable for any reason is empty or unset (wiki: SC2115).

This would mitigate the effect of a whole slew of pitfalls that can leave a variable empty, including echo /tmp | read DIR (subshells), DIR= /tmp (bad spacing) and DIR=$(echo /tmp) (potential fork/command failures).

Conclusion

Shell scripts are really convenient, but also have a large number of potential pitfalls. Many issues that would be simple, fail-fast syntax errors in other languages would instead cause a script to misbehave in confusing, annoying, or catastrophic ways. Many examples can be found in the Wooledge Bash Pitfalls list, or ShellCheck’s own gallery of bad code.

Since tooling exists, why not take advantage? Even if (or especially when!) you rarely write shell scripts, you can install shellcheck from your package manager, along with a suitable editor plugin like Flycheck (Emacs) or Syntastic (Vim), and just forget about it.

The next time you’re writing a script, your editor will show warnings and suggestions automatically. Whether or not you want to fix the more pedantic style issues, it may be worth looking at any unexpected errors and warnings. It might just save your database.

An ode to pack: gzip’s forgotten decompressor

The latest 4.13.9 source release of the Linux kernel is 780MiB, but thanks to xz compression, the download is a much more managable 96 MiB (an 88% reduction)

Before xz took over as the default compression format on kernel.org in 2013, following the "latest" link would have gotten you a bzip2 compressed file. The tar.bz2 would have been 115 MiB (-85%), but there’s was no defending the extra 20 MiB after xz caught up in popularity. bzip2 is all but displaced today.

bzip2 became the default in 2003, though it had long been an option over the less efficient gzip. However, since every OS, browser, language core library, phone and IoT lightswitch has built-in support for gzip, a 148 MiB (-81%) tar.gz remains an option even today.

gzip itself started taking over in 1994, before kernel.org, and before the World Wide Web went mainstream. It must have been a particularly easy sell for the fledgeling Linux kernel: it was made, used and endorsed by the mighty GNU project, it was Free Software, free of patent restrictions, and it provided powerful .zip style DEFLATE compression in a Unix friendly package.

Another nice benefit was that gzip could decompress other contemporary formats, thereby replacing contested and proprietary software.

Among the tools it could replace was compress, the de-facto Unix standard at the time. Created based on LZW in 1985, it was hampered by the same patent woes that plagued GIF files. The then-ubiquitous .Z suffix graced the first public Linux releases, but is now recognized only by the most long-bearded enthusiasts. The current release would have been 302 MiB (-61%) with compress.

Another even more obscure tool it could replace was compress‘s own predecessor, pack. This rather loosely defined collection of only partially compatible formats is why compress had to use a capital Z in its extension. pack came first, and offered straight Huffman coding with a .z extension.

With pack, our Linux release would have been 548 MiB (-30%). Compared to xz‘s 96 MiB, it’s obvious why no one has used it for decades.

Well, guess what: gzip never ended its support! Quoth the man page,

gunzip can currently decompress files created by gzip, zip,
compress, compress -H or pack.

While multiple implementations existed, these were common peculiarities:

  • They could not be used in pipes.
  • They could not represent empty files.
  • They could not compress a file with only one byte value, e.g. "aaaaaa…"
  • They could fail on "large" files. "can’t occur unless [file size] >= [16MB]", a comment said dismissively, from the time when a 10MB hard drive was a luxury few could afford.

These issues stemmed directly from the Huffman coding used. Huffman coding, developed in 1952, is basically an improvement on Morse code, where common characters like "e" get a short code like "011", while uncommon "z" gets a longer one like "111010".

  • Since you have to count the characters to figure out which are common, you can not compress in a single pass in a pipe. Now that memory is cheap, you could mostly get around that by keeping the data in RAM.

  • Empty files and single-valued files hit an edge case: if you only have a single value, the shortest code for it is the empty string. Decompressors that didn’t account for it would get stuck reading 0 bits forever. You can get around it by adding unused dummy symbols to ensure a minimum bit length of 1.

  • A file over 16MB could cause a single character to be so rare that its bit code was 25+ bits long. A decompressor storing the bits to be decoded in a 32bit value (a trick even gzip uses) would be unable to append a new 8bit byte to the buffer without displacing part of the current bit code. You can get around that by using "package merge" length restricted prefix codes over naive Huffman codes.

I wrote a Haskell implementation with all these fixes in place: koalaman/pack is available on GitHub.

During development, I found that pack support in gzip had been buggy since 2012 (version 1.6), but no one had noticed in the five years since. I tracked down the problem and I’m happy to say that version 1.9 will again restore full pack support!

Anyways, what could possibly be the point of using pack today?

There is actually one modern use case: code golfing.

This post came about because I was trying to implement the shortest possible program that would output a piece of simple ASCII art. A common trick is variations of a self-extracting shell script:

sed 1d $0|gunzip;exit
<compressed binary data here>

You can use any available compressor, including xz and bzip2, but these were meant for bigger files and have game ruining overheads. Here’s the result of compressing the ASCII art in question:

  • raw: 269 bytes
  • xz: 216 bytes
  • bzip2: 183 bytes
  • gzip: 163 bytes
  • compress: 165 bytes
  • and finally, pack: 148 bytes!

I was able to save 15 bytes by leveraging gzip‘s forgotten legacy support. This is huge in a sport where winning entries are bytes apart.

Let’s have a look at this simple file format. Here’s an example pack file header for the word "banana":

1f 1e        -- Two byte magic header
00 00 00 06  -- Original compressed length (6 bytes)

Next comes the Huffman tree. Building it is simple to do by hand, but too much for this post. It just needs to be complete, left-aligned, with eof on the right at the deepest level. Here’s the optimal tree for this string:

        /\
       /  a
      /\
     /  \
    /\   n
   b  eof

We start by encoding its depth (3), and the number of leaves on each level. The last level is encoded minus 2, because the lowest level will have between 2 and 257 leaves, while a byte can only store 0-255.

03  -- depth
01  -- level 1 only contains 'a'
01  -- level 2 only contains 'n'
00  -- level 3 contains 'b' and 'eof', -2 as mentioned

Next we encode the ASCII values of the leaves in the order from top to bottom, left to right. We can leave off the EOF (which is why it needs to be in the lower right):

61 6e 62  -- "a", "n" ,"b"

This is enough for the decompressor to rebuild the tree. Now we go on to encode the actual data.

Starting from the root, the Huffman codes are determined by adding a 0 for ever left branch and 1 for every right branch you have to take to get to your value:

a   -> right = 1
n   -> left+right = 01
b   -> left+left+left -> 000
eof -> left+left+right -> 001

banana<eof> would therefore be 000 1 01 1 01 1 001, or when grouped as bytes:

16  -- 0001 0110
C8  -- 1100 1   (000 as padding)

And that’s all we need:

$ printf '\x1f\x1e\x00\x00\x00\x06'\
'\x03\x01\x01\x00\x61\x6e\x62\x16\xc8' | gzip -d
banana

Unfortunately, the mentioned gzip bug triggers due to failing to account for leading zeroes in bit code. eof and a have values 001 and 1, so an oversimplified equality check confuses one for the other, causing gzip to terminate early:

b
gzip: stdin: invalid compressed data--length error

However, if you’re stuck with an affected version, there’s another party trick you can do: the Huffman tree has to be canonical, but it does not have to be optimal!

What would happen if we skipped the count and instead claimed that each ASCII character is equally likely? Why, we’d get a tree of depth 8 where all the leaf nodes are on the deepest level.

It then follows that each 8 bit character will be encoded as 8 bits in the output file, with the bit patterns we choose by ordering the leaves.

Let’s add a header with a dummy length to a file:

$ printf '\x1F\x1E____' > myfile.z

Now let’s append the afforementioned tree structure, 8 levels with all nodes in the last one:

$ printf '\x08\0\0\0\0\0\0\0\xFE' >> myfile.z

And let’s populate the leaf nodes with 255 bytes in an order of our choice:

$ printf "$(printf '\\%o' {0..254})" |
    tr 'A-Za-z' 'N-ZA-Mn-za-m' >> myfile.z

Now we can run the following command, enter some text, and hit Ctrl-D to "decompress" it:

$ cat myfile.z - | gzip -d 2> /dev/null
Jr unir whfg pbaivaprq TMvc gb hafpenzoyr EBG13!
<Ctrl+D>
We have just convinced GZip to unscramble ROT13!

Can you think of any other fun ways to use or abuse gzip‘s legacy support? Post a comment.

[ -z $var ] works unreasonably well

There is a subreddit /r/nononoyes for videos of things that look like they’ll go horribly wrong, but amazingly turn out ok.

[ -z $var ] would belong there.

It’s a bash statement that tries to check whether the variable is empty, but it’s missing quotes. Most of the time, when dealing with variables that can be empty, this is a disaster.

Consider its opposite, [ -n $var ], for checking whether the variable is non-empty. With the same quoting bug, it becomes completely unusable:

Input Expected [ -n $var ]
“” False True!
“foo” True True
“foo bar” True False!

These issues are due to a combination of word splitting and the fact that [ is not shell syntax but traditionally just an external binary with a funny name. See my previous post Why Bash is like that: Pseudo-syntax for more on that.

The evaluation of [ is defined in terms of the number of argument. The argument values have much less to do with it. Ignoring negation, here’s a simplified excerpt from POSIX test:

# Arguments Action Typical example
0 False [ ]
1 True if $1 is non-empty [ "$var" ]
2 Apply unary operator $1 to $2 [ -x "/bin/ls" ]
3 Apply binary operator $2 to $1 and $3 [ 1 -lt 2 ]

Now we can see why [ -n $var ] fails in two cases:

When the variable is empty and unquoted, it’s removed, and we pass 1 argument: the literal string “-n”. Since “-n” is not an empty string, it evaluates to true when it should be false.

When the variable contains foo bar and is unquoted, it’s split into two arguments, and so we pass 3: “-n”, “foo” and “bar”. Since “foo” is not a binary operator, it evaluates to false (with an error message) when it should be true.

Now let’s have a look at [ -z $var ]:

Input Expected [ -z $var ] Actual test
“” True: is empty True 1 arg: is “-z” non-empty
“foo” False: not empty False 2 args: apply -z to foo
“foo bar” False: not empty False (error) 3 args: apply “foo’ to -z and bar

It performs a completely wrong and unexpected action for both empty strings and multiple arguments. However, both cases fail in exactly the right way!

In other words, [ -z $var ] works way better than it has any possible business doing.

This is not to say you can skip quoting of course. For “foo bar”, [ -z $var ] in bash will return the correct exit code, but prints an ugly error in the process. For ” ” (a string with only spaces), it returns true when it should be false, because the argument is removed as if empty. Bash will also incorrectly pass var="foo -o x" because it ends up being a valid test through code injection.

The moral of the story? Same as always: quote, quote quote. Even when things appear to work.

ShellCheck is aware of this difference, and you can check the code used here online. [ -n $var ] gets an angry red message, while [ -z $var ] merely gets a generic green quoting warning.

Swearing in the Linux kernel: now interactive

Graphs showing a rise in "crap" and fall in "fuck" over time.
 

If you’ve followed discussions on Linux, you may at some point have bumped into a funny graph showing how many times frustrated Linux kernel developers have put four letter words into the source code.

Today, for the first time in 12 years, it’s gotten a major revamp!

You can now interactively plot any words of your choice with commit level granularity.

 

Did you find any interesting insights? Post a comment!

Useless Use Of dd

tl;dr: dd works for reading and writing disks, but it has no advantages and some disadvantages over just treating the disk as a regular file. In many commands it’s entirely pointless.

If you’ve ever used dd, you’ve probably used it to read or write disk images:

# Write myfile.iso to a USB drive
dd if=myfile.iso of=/dev/sdb bs=1M

Usage of dd in this context is so pervasive that it’s being hailed as the magic gatekeeper of raw devices. Want to read from a raw device? Use dd. Want to write to a raw device? Use dd.

This belief adds complexity to simple commands. How do you combine dd with gzip? How do you use pv if the source is raw device? How do you dd over ssh?

People cleverly find ways to insert dd at the front and end of pipelines. dd if=/dev/sda | gzip > image.gz, they say. dd if=/dev/sda | pv | dd of=/dev/sdb.

In both these cases, dd serves no purpose. It’s purely a superstitious charm trying to ensure safe passage of the data. It’s like cat /dev/sda | pv | cat > /dev/sdb except not as efficient.

The fact of the matter is, dd is not a disk writing tool. Neither “d” is for “disk”, “drive” or “device”. It does not support “low level” reading or writing. It has no special dominion over any kind of device whatsoever.

dd just reads and writes file.

On UNIX, the adage goes, everything is a file. This includes raw disks. Since raw disks are files, and dd can be used to copy files, dd be used to copy raw disks.

But do you know what else can read and write files? Everything:

# Write myfile.iso to a USB drive
cp myfile.iso /dev/sdb

# Rip a cdrom to a .iso file
cat /dev/cdrom > myfile.iso

# Create a gzipped image
gzip -9 < /dev/sdb > /tmp/myimage.gz

dd can even end up doing a worse job. By specification, its default 512 block size has had to remain unchanged for decades. Today, this tiny size makes it CPU bound by default. A script that doesn’t specify a block size is very inefficient, and any script that picks the current optimal value may slowly become obsolete — or start obsolete if it’s copied from

Meanwhile, cat is free to choose its buffer size that best serves a modern system, and the GNU cat buffer size has grown steadily over the years from 512 bytes in 1991 to 131072 bytes in 2014. ./src/ioblksize.h in the coreutils source code has benchmarks backing up this decision.

However, this does not mean that dd should necessarily be categorically shunned! The reason why people started using it in the first place is that it does exactly what it’s told: no more and no less.

If an alias specifies -a, cp might try to create a new block device rather than a copy of the file data. If using gzip without redirection, it may try to be helpful and skip the file for not being regular. Neither of them will write out a reassuring status during or after a copy.

dd, meanwhile, has one job*: copy data from one place to another. It doesn’t care about files, safeguards or user convenience. It will not try to second guess your intent, based on trailing slashes or types of files.

However, when this is no longer a convenience, like when combining it with other tools that already read and write files, one should not feel guilty for leaving dd out entirely.

This is not to say I think dd is overrated! Au contraire! It’s one of my favorite Unix tools!

dd is the swiss army knife of the open, read, write and seek syscalls. It’s unique in its ability to issue seeks and reads of specific lengths, which enables a whole world of shell scripts that have no business being shell scripts. Want to simulate a lseek+execve? Use dd! Want to open a file with O_SYNC? Use dd! Want to read groups of three byte pixels from a PPM file? Use dd!

It’s a flexible, unique and useful tool, and I love it. My only issue is that, far too often, this great tool is being relegated to and inappropriately hailed for its most generic and least interesting capability: simply copying a file from start to finish.


* dd actually has two jobs: Convert and Copy. A post on comp.unix.misc (incorrectly) claimed that the intended name “cc” was taken by the C compiler, so the letters were shifted in the same way we ended up with a Window system called X. A more likely explanation is given in that thread as pointed out by Paweł and Bruce in the comments: the name, syntax and purpose is almost identical to the JCL “Dataset Definition” command found in 1960s IBM mainframes.

I’m not paranoid, you’re just foolish

Remember this dialog from when you installed your distro?

Fake dialog saying "In the event of physical theft, grant perpetrators access to" with options for "My browsing history, My email and social media, My photos and documents, and similar". All boxes are checked by default.

Most distros have a step like this. If you don’t immediately recognize it, you might have used a different installer with different wording. For example, the graphical Ubuntu installer calls it “Encrypt the new Ubuntu installation for security”, while the text installer even more opaquely calls it “Use entire disk and set up encrypted LVM”.

Somehow, some people have gotten it into their heads that not granting the new owner access to all your data after they steal your computer is a sign of paranoia. It’s 2015, and there actually exists people who have information based jobs and spend half their lives online, who not only think disk encryption is unnecessary but that it’s a sign you’re doing something illegal.

I have no idea what kind of poorly written crime dramas or irrational prime ministers they get this ridiculous notion from.

An office desk covered in broken glass after a break-in.
The last time my laptop was stolen from a locked office building.

Here’s a photo from 2012, when my company laptop was taken from a locked office with an alarm system.

Was this the inevitable FBI raid I was expecting and encrypted my drive to thwart?

Or was it a junkie stealing an office computer from a company and user who, thanks to encryption, didn’t have to worry about online accounts, design documents, or the source code for their unreleased product?

Hundreds of thousands of computers are lost or stolen every year. I’m not paranoid for using disk encryption, you’re just foolish if you don’t.