The curious pitfalls in shell redirections to $((i++))

ShellCheck v0.7.1 has just been released. It primarily has cleanups and bugfixes for existing checks, but also some new ones. The new check I personally find the most fascinating is this one, for an issue I haven’t really seen discussed anywhere before:

In demo line 6:
  cat template/header.txt "$f" > archive/$((i++)).txt
                                             ^
  SC2257: Arithmetic modifications in command redirections
          may be discarded. Do them separately.

Here’s the script in full:

#!/bin/bash
i=1
for f in *.txt
do
  echo "Archiving $f as $i.txt"
  cat template/header.txt "$f" > archive/$((i++)).txt
done

Seasoned shell scripter may already have jumped ahead, tried it in their shell, and found that the change is not discarded, at least not in their Bash 5.0.16(1):

bash-5.0$ i=0; echo foo > $((i++)).txt; echo "$i" 
1

Based on this, you may be expecting a quick look through the Bash commit history, and maybe a plea that we should be kind to our destitute brethren on macOS with Bash 3.

But no. Here’s the demo script on the same system:

bash-5.0$ ./demo
Archiving chocolate_cake_recipe.txt as 1.txt
Archiving emo_poems.txt as 1.txt
Archiving project_ideas.txt as 1.txt

The same is true for source ./demo, which runs the script in the exact same shell instance that we just tested on. Furthermore, it only happens in redirections, and not in arguments.

So what’s going on?

As it turns out, Bash, Ksh and BusyBox ash all expand the redirection filename as part of setting up file descriptors. If you are familiar with the Unix process model, the pseudocode would be something like this:

if command is external:
  fork child process:
    filename := expandString(command.stdout) # Increments i
    fd[1] := open(filename)
    execve(command.executable, command.args)
else:
  filename := expandString(command.stdout)   # Increments i
  tmpFd := open(filename)
  run_internal_command(command, stdout=tmpFD)
  close(tmpFD)

In other words, the scope of the variable modification depends on whether the shell forked off a new process in anticipation of executing the command.

For shell builtin commands that don’t or can’t fork, like echo, this means that the change takes effect in the current shell. This is the test we did.

For external commands, like cat, the change is only visible between the time the file descriptor is set up until the command is invoked to take over the process. This is what the demo script does.

Of course, subshells are well known to experienced scripters, and also described on this blog in the article Why Bash is like that: Subshells, but to me, this is a new and especially tricky source of them.

For example, the script works fine in busybox sh, where cat is a builtin:

$ busybox sh demo
Archiving chocolate_cake_recipe.txt as 1.txt
Archiving emo_poems.txt as 2.txt
Archiving project_ideas.txt as 3.txt

Similarly, the scope may depend on whether you overrode any commands with a wrapper function:

awk() { gawk "$@"; }
# Increments
awk 'BEGIN {print "hi"; exit;}' > $((i++)).txt
# Does not increment
gawk 'BEGIN {print "hi"; exit;}' > $((i++)).txt  

Or if you want to override an alias, the result depends on whether you used command or a leading backslash:

# Increments
command git show . > $((i++)).txt
# Does not increment
\git show . > $((i++)).txt

To avoid this confusion, consider following ShellCheck’s advice and just increment the variable separately if it’s part of the filename in a redirection:

anything > "$((i++)).txt"
: $((i++))

Thanks to Strolls on #bash@Freenode for pointing out this behavior.

PS: While researching this article, I found that dash always increments (though with $((i=i+1)) since it doesn’t support ++). ShellCheck v0.7.1 still warns, but git master does not.

Lessons learned from writing ShellCheck, GitHub’s now most starred Haskell project

ShellCheck is a static analysis tool that points out common problems and pitfalls in shell scripts.

As of last weekend it appears to have become GitHub’s most starred Haskell repository, after a mention in MIT SIPB’s Writing Safe Shell Scripts guide.

While obviously a frivolous metric in a niche category, I like to interpret this as meaning that people are finding ShellCheck as useful as I find Pandoc, the excellent universal document converter I use for notes, blog posts and ShellCheck’s man page, and which held a firm grip on the top spot for a long time.

I am very happy and humbled that so many people are finding the project helpful and useful. The response has been incredibly and overwhelmingly positive. Several times per week I see mentions from people who tried it out, and it either solved their immediate problem, or it taught them something new and interesting they didn’t know before.

I started the project 8 years ago, and this seems like a good opportunity to share some of the lessons learned along the way.

Quick History

ShellCheck is generally considered a shell script linter, but it actually started life in 2012 as an IRC bot (of all things!) on #bash@Freenode. It’s still there and as active as ever.

The channel is the home of the comprehensive and frequently cited Wooledge BashFAQ, plus an additional list of common pitfalls. Between them, they currently cover 178 common questions about Bash and POSIX sh.

Since no one ever reads the FAQ, an existing bot allowed regulars to e.g. answer any problem regarding variations of for file in `ls` with a simple !pf 1, and let a bot point the person in the right direction (the IRC equivalent of StackOverflow’s "duplicate of").

ShellCheck’s original purpose was essentially to find out how many of these FAQs could be classified automatically, without any human input.

Due to this, ShellCheck was designed for different goals than most linters.

  1. It would only run on buggy scripts, because otherwise they wouldn’t have been posted.
  2. It would only run once, and should be as helpful as possible on the first pass.
  3. It would run on my machine, not on arbitrary user’s systems.

This will become relevant.

On Haskell

Since ShellCheck was a hobby project that wasn’t intended to run on random people’s machines, I could completely ignore popularity, familiarity, and practicality, and pick the language that was the most fun and interesting.

That was, of course, Haskell.

As anyone who looks at code will quickly conclude, ShellCheck was my first real project in the language.

Some things worked really well:

  • QuickCheck has been beyond amazing. ShellCheck has 1500 unit tests just because they’re incredibly quick and convenient to write. It’s so good that I’m adding a subsection for it.
  • Parsec is a joy to write parsers in. Initially I dreaded e.g. implementing backticks because they require recursively re-invoking the parser on escaped string data, but every time I faced such issues, they turned out to be much easier than expected.
  • Haskell itself is a very comfy, productive language to write. It’s not at all arcane or restrictive as first impressions might have you believe. I’d take it over Java or C++ for most things.
  • Haskell is surprisingly portable. I was shocked when I discovered that people were running ShellCheck natively on Windows without problems. ARM required a few code changes at the time, but wouldn’t have today.

Some things didn’t work as well:

  • Haskell has an undeniably high barrier to entry for the uninitiated, and ShellCheck’s target audience is not Haskell developers. I think this has seriously limited the scope and number of contributions.
  • It’s easier to write than to run: it’s been hard to predict and control runtime performance. For example, many of ShellCheck’s check functions take an explicit "params" argument. Converting them to a cleaner ReaderT led to a 10% total run time regression, so I had to revert it. It makes me wonder about the speed penalty of code I designed better to begin with.
  • Controlling memory usage is also hard. I dropped multithreading support because I simply couldn’t figure out the space leaks.
  • For people not already invested in the ecosystem, the runtime dependencies can be 100MB+. ShellCheck is available as a standalone ~8MB executable, which took some work and is still comparatively large.
  • The Haskell ecosystem moves and breaks very quickly. New changes would frequently break on older platform versions. Testing the default platform version of mainstream distros in VMs was slow and tedious. Fortunately, Docker came along to make it easy to automate per-distro testing, and Stack brought reproducible Haskell builds.

If starting a new developer tooling project for a mainstream audience, I might choose a more mainstream language. I’d also put serious consideration into how well the language runs on a JSVM, since (love it or hate it) this would solve a lot of distribution, integration, and portability issues.

ShellCheck’s API is not very cohesive. If starting a new project in Haskell today, I would start out by implementing a dozen business logic functions in every part of the system in my best case pseudocode. This would help me figure out the kind of DSL I want, and help me develop a more uniform API on a suitable stack of monads.

Unit testing made fun and easy

ShellCheck is ~10k LoC, but has an additional 1.5k unit tests. I’m not a unit testing evangelist, 100% completionist or TDD fanatic: this simply happened by itself because writing tests was so quick and easy. Here’s an example check:

prop_checkSourceArgs1 = verify checkSourceArgs "#!/bin/sh\n. script arg"
prop_checkSourceArgs2 = verifyNot checkSourceArgs "#!/bin/sh\n. script"
prop_checkSourceArgs3 = verifyNot checkSourceArgs "#!/bin/bash\n. script arg"
checkSourceArgs = CommandCheck (Exactly ".") f
  where
f t = whenShell [Sh, Dash] $
    case arguments t of
	(file:arg1:_) -> warn (getId arg1) 2240 $
	    "The dot command does not support arguments in sh/dash. Set them as variables."
	_ -> return ()

The prop_.. lines are individual unit tests. Note in particular that:

  • Each test simply specifies whether the given check emits a warning for a snippet. The boilerplate fits on the same line
  • The test is in the same file and place as the function, so it doesn’t require any cross-referencing
  • It doubles as a doc comment that explains what the function is expected to trigger on
  • checkSourceArgs is in OO terms an unexposed private method, but no OO BS was required to "expose it for testing"

Even the parser has tests like this, where it can check whether a given function parses the given string cleanly or with warnings.

QuickCheck is better known for its ability to generate test cases for invariants, which ShellCheck makes some minimal use of, but even without that I’ve never had a better test writing experience in any previous project of any language.

On writing a parser

ShellCheck was the first real parser I ever wrote. I’ve since taken up a day job as a compiler engineer, which helps to put a lot of it into perspective.

My most important lessons would be:

  • Be careful if your parser framework makes it too easy to backtrack. Look up good parser design. I naively wrote a character based parser function for each construct like ${..}, $(..), $'..', etc, and now the parser has to backtrack a dozen times to try every possibility when it hits a $. With a tokenizer or a parser that read $ followed by {..}, (..) etc, it would have been much faster — in fact, just grouping all the $ constructs behind a lookahead decreased total checking time by 10%.
  • Consistently use a tab stop of 1 for all column counts. This is what e.g. GCC does, and it makes life easier for everyone involved. ShellCheck used Parsec’s default of 8, which has been a source of alignment bugs and unnecessary conversions ever since.
  • Record the full span and not just the start index of your tokens. Everyone loves squiggly lines under bad code. Also consider whether you want to capture comments and whitespace so you can turn the AST back into a script for autofixing. ShellCheck retrofitted end positions and emits autofixes as a series of text edits based on token spans, and it’s neither robust nor convenient.

ShellCheck’s parser has historically also been, let’s say, "pragmatic". For example, shell scripts are case sensitive, but ShellCheck accepted While in place of while for loops.

This is generally considered heresy, but it originally made sense when ShellCheck needed to be as helpful as possible on the first try for a known buggy script. Neither ShellCheck nor a human would not point out that While sleep 1; do date; done has a misplaced do and done, but most linters would since While is not considered a valid start of a loop.

These days it not as useful, since any spurious warnings about do would disappear when the user fixed the warning for While and reran ShellCheck.

It also gets in the way for advanced users who e.g. write a function called While and capitalized it that way because they don’t want it treated as a shell keyword. ShellCheck has rightly received some critisism for focusing too much on newbie mistakes at the expense of noise for advanced users. This is an active area of development.

If designed again, ShellCheck would parse more strictly according to spec, and instead make liberal use of lookaheads with pragmatic interpretations to emit warnings, even if it often resulted in a single useful warning at a time.

On writing a static analysis tool

I hadn’t really pondered, didn’t really use, and definitely hadn’t written any static analysis or linting tools before. The first versions of ShellCheck didn’t even have error codes, just plain English text befitting an IRC bot.

  • Supplement terse warnings with a wiki/web page. ShellCheck’s wiki has a page for each warning, like SC2162. It has an example of code that triggers, an example of a fix, a mention of cases where it may not apply, and has especially received a lot of praise for having an educational rationale explaining why this is worth fixing.
  • You’re not treading new ground. There are well studied algorithms for whatever you want to do. ShellCheck has some simplistic ad-hoc algorithms for e.g. variable liveness, which could and should have been implemented using robust and well known techniques.
  • If designed today with 20/20 hindsight, ShellCheck would have a plan to work with (or as) a Language Server to help with editor integrations.
  • Include simple ways to suppress warnings. Since ShellCheck was originally intended for one-shot scenarios, this was an afterthought. It was then added on a statement level where the idea was that you could put special comments in front of a function, loop, or regular command, and it would apply to the entire thing. This has been an endless source of confusion (why can’t you put it in front of a case branch?), and should have been done on a per-line level instead.
  • Give tests metadata so you can filter them. ShellCheck’s original checks were simple functions invoked for each AST node. Some of them only applied to certain shells, but would still be invoked thousands of times just to check that they don’t apply and return. Command specific checks would all duplicate and repeat the work of determining whether the current node was a command, and whether it was the command. Disabled checks were all still run, and their hard work simply filtered out afterwards. With more metadata, these could have been more intelligently applied.
  • Gather a test corpus! Examining the diff between runs on a few thousand scripts has been invaluable in evaluating the potential, true/false positive rate, and general correctness of checks.
  • Track performance. I simply added time output to the aforementioned diff, and it stopped several space leaks and quadratic explosions.

For a test corpus, I set up one script to scrape pastebin links from #bash@Freenode, and another to scrape scripts from trending GitHub projects.

The pastebin links were more helpful because they exactly represented the types of scripts that ShellCheck wanted to check. However, though they’re generally simple and posted publically, I don’t actually have any rights to redistribute them, so I can’t really publish them to allow people to test their contributions.

The GitHub scripts are easier to redistribute since there’s provenance and semi-structured licensing terms, but they’re generally also less buggy and therefore less useful (except for finding false positives).

Today I would probably have tried parsing the Stack Exchange Data Dump instead.

Finally, ShellCheck is generally reluctant to read arbitrary files (e.g. requiring a flag -x to follow included scripts). This is obviously because it was first a hosted service on IRC and web before containerization was made simple, and not because this is in any way helpful or useful for a local linter.

On having a side project while working at a large company

I worked at Google when I started ShellCheck. They were good sports about it, let me run the project and keep the copyright, as long as I kept it entirely separate from my day job. I later joined Facebook, where the policies were the same.

Both companies independently discovered and adopted ShellCheck without my input, and the lawyers stressed the same points:

  • The company must not get, or appear to get, any special treatment because of you working there. For example, don’t prioritize bugs they find.
  • Don’t contribute to anything related to the project internally. Not even if it’s work related. Not even if it’s not. Not even on your own time.
  • If anyone assigns you a related internal task/bug, reject it and tell them they’ll have to submit a FOSS bug report.

And after discovering late in the interview process that Apple has a blanket ban on all programming related hobby projects:

  • Ask any potential new employer about their side project policy early on

On the name "ShellCheck"

I just thought it was a descriptive name with a cute pun. I had no idea that a portion of the population would consistently read "SpellCheck" no matter how many times they saw it. Sorry for the confusion!

What’s new in ShellCheck v0.7.0?

ShellCheck v0.7.0 has just been released. In addition to the usual “bug fixes and improvements”, there is a set of new features:

Autofixes

A few select warnings now come with auto-fixes. In the most straight-forward case, ShellCheck shows you what it thinks the line ought to be:

In foo line 2:
echo "File size: $(stat -c %s $1)"
                              ^-- SC2086: Double quote to prevent globbing and word splitting.

Did you mean:
echo "File size: $(stat -c %s "$1")"

To actually apply the fixes, you can use ShellCheck’s new diff output format, which outputs standard Unified Diffs that can be piped into tools like git apply and patch:

$ shellcheck -f diff foo
--- a/foo
+++ b/foo
@@ -1,2 +1,2 @@
 #!/bin/sh
-echo "File size: $(stat -c %s $1)"
+echo "File size: $(stat -c %s "$1")"

For example, to apply only SC2086 fixes to all .sh file in a project:

$ shellcheck --include=SC2086 -f diff **/*.sh | git apply

Optional Checks

ShellCheck now includes a small handful of checks that are off by default. These are intended for subjective issues that a project may choose to enforce:

$ cat foo
#!/bin/sh
# shellcheck enable=require-variable-braces
name=World
echo "Hello $name"

$ shellcheck foo
In foo line 4:
echo "Hello $name"
            ^---^ SC2250: Prefer putting braces around variable references even when not strictly required.

Did you mean:
echo "Hello ${name}"

For a list of such checks, run shellcheck --list-optional

source paths

ShellCheck now allows you to specify a list of search locations for sourced scripts using a # shellcheck source-path=/my/dir directive or --source-path flag.

This is useful in several cases:

  • If all the projects’ sourced files are relative to the same directory, you can now specify this directory once instead of having to add source directives everywhere.
  • The special name SCRIPTDIR can be specified in a path to refer to the location of the script being checked, allowing ShellCheck to more conveniently discover included files from the same directory. This also works for any path relative to the script’s directory, such as SCRIPTDIR/../include/
  • Absolute paths are also grounded in the source path, so by specifying source-path=/mnt/chroot, shellcheck will look for . /bin/funcs.sh in /mnt/chroot/bin/funcs.sh. This is useful when targeting a specific system, such as an embedded one.

RC files

Rather than adding directives in each file, you can now set most of the options above in a .shellcheckrc file in the project’s root directory (or your home directory). This allows you to easily apply the same options to all scripts on a per-project/directory basis.

Bats and shflags support

ShellCheck no longer needs any preprocessing to check Bats scripts:

$ cat test.bats
#!/usr/bin/env bats

@test "addition using bc" {
  result="$(echo 2+2 | bc)"
  [ "$result" -eq 4 ]
}

$ shellcheck test.bats && echo "Success"
Success

A bats shebang will be interpreted as “bash”, and @test statements will be correctly parsed.

ShellCheck now also recognizes DEFINE_* statements from the shflags library:

DEFINE_string 'name' 'world' 'name to say hello to' 'n'
              ^----^ SC2034: FLAGS_name appears unused. Verify use (or export if used externally).

For a more extensive list of changes, check out the ChangeLog.

Happy ShellChecking!

Tricking the tricksters with a next level fork bomb

Do not copy-paste anything from this article into your shell. You have been warned.

Some people make a cruel sport out of tricking newbies into running destructive shell commands.

Often, this takes the form of crudely obscured commands like this one, which will result in a rm -rf * being executed in the current directory, deleting everything:

$(echo cm0gLXJmICoK | base64 -d)

Years ago, I came across someone doing this, and decided to trick them back.

Now, I’m not enough of a jerk to trick anyone into deleting their files, but I’m more than willing to let wanna-be hackers fork bomb themselves.

I designed a fork bomb in such a way that even when people know it’s a destructive command, they still run it! At the risk of you doing the same, here it is:

eval $(echo "I<RA('1E<W3t`rYWdl&r()(Y29j&r{,3Rl7Ig}&r{,T31wo});r`26<F]F;==" | uudecode)

It looks like yet another crudely obscured command, but it’s not. It does not prey on unsuspecting newbies’ tendencies to run commands they don’t understand.

Instead, it targets people who are familiar with that kind of trick, who know it’s going to be destructive, and exploits their schadenfreude and curiosity.

For the previous command, such a person would remove the surrounding $(..) to find out what a victim would have been fooled into executing:

$ echo cm0gLXJmICoK | base64 -d
rm -rf *

But when they similarly modify this command to see what horror will befall the newbie stupid enough to run it:

echo "I<RA('1E<W3t`rYWdl&r()(Y29j&r{,3Rl7Ig}&r{,T31wo});r`26<F]F;==" | uudecode

They’ll suddenly find their system slowing to a crawl until a forced reboot! As it turns out, they were the newbie all along.

You see, the eval (…dramatic pause…) was a decoy!

In fact, the uudecode, echo and $(..) were all just part of the act. They’re purely for misdirection, and don’t serve any functional purpose.

No decoding, execution or evaluation is required for the bomb to explode. Instead it’s set off by the simple expansion, in any context, of this argument:

"I<RA('1E<W3t`rYWdl&r()(Y29j&r{,3Rl7Ig}&r{,T31wo});r`26<F]F;=="

Even most of this string is just for show, designed to make it look more like uuencoded data. Here it is with all the arbitrary characters replaced with underscores:

"____________`_____&r()(____&r{,______}&r{,_____});r`_________"

And here it’s written more cleanly:

" `r() ( r & r ); r` "

Now it’s your bog standard fork bomb in a command expansion.


I went through a few iterations designing this trap. The first one was this:

eval $(echo 'a2Vrf3xvcml'\ZW%3t`r()(r|r);r`2'6a2VrZQo=' | base64 -d)

It has the same basic form, but several problems:

  • Base64 is pretty well known, and this clearly isn’t it
  • It’s quite obvious from the quotes that the literal string stops and starts
  • The fork bomb, r()(r|r);r really sticks out

base64 is almost entirely alphanumeric, e.g. bW9yZSBnYXJiYWdlIGhlcmUK, while uuencoded data (if you can even remember what it looks like), has a bunch of symbols that would obscure any embedded shell code: 1<V]M92!G87)B86=E(&AE<F4`. I broke up the long gibberish base64-ish strings with symbols to match.

For the quotes, I shoved it in simple double quotes and hoped no one would notice the amount of questionable characters put in an interpolated string.

For the bomb itself, I wanted to find a way to insert more gibberish, but without adding any spaces that attract the eyes. Making the string r longer would work, but the repetition would be noticeable.

The fix I ended up with was using brace expansion: foo.{jpg,png} expands to foo.jpg foo.png, and r{,foo} expands to r foo. This invokes r with an argument that the function ignores.

The second version was this:

eval $(echo "I<RA('1E<W3t`p&r()(rofl&r{,3Rl7Ig}&r{,T31wo});r`26<F]F;==" | uudecode)

The idea here was that rofl would be executed on every fork, filling the screen with “rofl: command not found” for some extra finesse, but I figured that such a recognizable word would attract attention and further scrutiny.

In the end, I arrived at the final version, and it was quite effective. Several people involved in the noob sniping sheepishly admitted that they fell for it.

I essentially forgot about it, but other people apparently didn’t. About a year later someone asked about it on SuperUser, where you can find an even better analysis.

And now you have the backstory as well.

A shell script that deleted a database, and how ShellCheck could have helped

Summary: We examine a real world case of how an innocent shell scripting mistake caused the deletion of a production database, and how ShellCheck (a GPLv3 shell script linting and analysis tool) would have pointed out the errors and prevented the disaster.

Disclosure: I am the ShellCheck author.

The event

Here is the sad case, taken from a recent StackOverflow post:

My developer committed a huge mistake and we cannot find our mongo database anyone in the server. Rescue please!!!

He logged into the server, and saved the following shell under ~/crontab/mongod_back.sh:

#!/bin/sh
DUMP=mongodump
OUT_DIR=/data/backup/mongod/tmp     // 备份文件临时目录
TAR_DIR=/data/backup/mongod         // 备份文件正式目录
DATE=`date +%Y_%m_%d_%H_%M_%S`      // 备份文件将以备份时间保存
DB_USER=Guitang                     // 数据库操作员
DB_PASS=qq____________              // 数据库操作员密码
DAYS=14                             // 保留最新14夭的备份
TAR_BAK="mongod_bak_$DATE.tar.gz"   // 备份文件命名格式
cd $OUT_DIR                         // 创建文件夹
rm -rf $OUT_DIR/*                   // 清空临时目录
mkdir -p $OUT_DIR/$DATE             // 创建本次备份文件夹
$DUMP -d wecard -u $DB_USER -p $DB_PASS -o $OUT_DIR/$DATE  // 执行备份命令
tar -zcvf $TAR_DIR/$TAR_BAK $OUT_DIR/$DATE       // 将备份文件打包放入正式目
find $TAR_DIR/ -mtime +%DAYS -delete             // 删除14天前的旧备洲

And then he run ./mongod_back.sh, then there were lots of permission denied, then he did Ctrl+C. Then the server shut down automatically.

He then contacted AliCloud, the engineer connected the disk to another working server, so that he could check the disk. Then, he realized that some folders have gone, including /data/ where the mongodb is!!!

PS: he did not take snapshot of the disk before.

Essentially, it’s every engineer’s nightmare.

The post-mortem of this issue is an interesting puzzle that requires only basic shell scripting knowledge. If you’d like to give it a try, now’s the time. If you’d like some hints, here’s shellcheck’s output for the script.

The rest of this post details about what happened, and how ShellCheck could have averted the disaster.

What went wrong?

The MCVE for how to ruin your week is this:

#!/bin/sh
DIR=/data/tmp    // The directory to delete
rm -rf $DIR/*    // Now delete it

The fatal error here is that // is not a comment in shell scripts. It’s a path to the root directory, equivalent to /.

On some platforms, the rm line would have been fatal by itself, because it’d boil down to rm -rf / with a few other arguments. Implementation these days often don’t allow this though. The disaster in question happened on Ubuntu, whose GNU rm would have refused:

$ rm -rf //
rm: it is dangerous to operate recursively on '//' (same as '/')
rm: use --no-preserve-root to override this failsafe

This is where the assignment comes in.

The shell treats variable assignments and commands as two sides of the same coin. Here’s the description from POSIX:

A “simple command” is a sequence of optional variable assignments and redirections, in any sequence, optionally followed by words and redirections, terminated by a control operator.

(A “simple command” is in contrast to a “compound” command, which are structures like if statements and for loops that contain one or more simple or compound commands.)

This means that var=42 and echo "Hello" are both simple commands. The former has one optional assignment and zero optional words. The latter has zero optional assignments and two optional words.

It also implies that a single simple command can contain both: var=42 echo "Hello"

To make a long spec short, assignments in a simple command will apply only to the invoked command name. If there is no command name, they apply to the current shell. This latter explains var=42 by itself, but when would you use the former?

It’s useful when you want to set a variable for a single command without affecting your the rest of your shell:

$ echo "$PAGER"  # Show current pager
less

$ PAGER="head -n 5" man ascii
ASCII(7)       Linux Programmer's Manual      ASCII(7)

NAME
       ascii  -  ASCII character set encoded in octal,
       decimal, and hexadecimal

$ echo "$PAGER"  # Current pager hasn't changed
less

This is exactly what happened unintentionally in the fatal assignment. Just like how the previous example scoped PAGER to man only, this one scoped DIR to //:

$ DIR=/data/tmp    // The directory to delete
bash: //: Is a directory

$ echo "$DIR"  # The variable is unset
(no output)

This meant that rm -rf $DIR/* became rm -rf /*, and therefore bypassed the check that was is in place for rm -rf /

(Why can’t or won’t rm simply refuse to delete /* too? Because it never sees /*: the shell expands it first, so rm sees /bin /boot /dev /data .... While rm could obviously refuse to remove first level directories as well, this starts getting in the way of legitimate usage – a big sin in the Unix philosophy)

How ShellCheck could have helped

Here’s the output from this minimized snippet (see online):

$ shellcheck myscript

In myscript line 2:
DIR=/data/tmp    // The directory to delete
                 ^-- SC1127: Was this intended as a comment? Use # in sh.


In myscript line 3:
rm -rf $DIR/*    // Now delete it
       ^----^ SC2115: Use "${var:?}" to ensure this never expands to /* .
       ^--^ SC2086: Double quote to prevent globbing and word splitting.
                 ^-- SC2114: Warning: deletes a system directory.

Two issues have already been discussed, and would have averted this disaster:

  • ShellCheck noticed that the first // was likely intended as a comment (wiki: SC1127).
  • ShellCheck pointed out that the second // would target a system directory (wiki: SC2114).

The third is a general defensive technique which would also have prevented this catastrophic rm independently of the two other fixes:

  • ShellCheck suggested using rm -rf ${DIR:?}/* to abort execution if the variable for any reason is empty or unset (wiki: SC2115).

This would mitigate the effect of a whole slew of pitfalls that can leave a variable empty, including echo /tmp | read DIR (subshells), DIR= /tmp (bad spacing) and DIR=$(echo /tmp) (potential fork/command failures).

Conclusion

Shell scripts are really convenient, but also have a large number of potential pitfalls. Many issues that would be simple, fail-fast syntax errors in other languages would instead cause a script to misbehave in confusing, annoying, or catastrophic ways. Many examples can be found in the Wooledge Bash Pitfalls list, or ShellCheck’s own gallery of bad code.

Since tooling exists, why not take advantage? Even if (or especially when!) you rarely write shell scripts, you can install shellcheck from your package manager, along with a suitable editor plugin like Flycheck (Emacs) or Syntastic (Vim), and just forget about it.

The next time you’re writing a script, your editor will show warnings and suggestions automatically. Whether or not you want to fix the more pedantic style issues, it may be worth looking at any unexpected errors and warnings. It might just save your database.

So what exactly is -ffunction-sections and how does it reduce binary size?

If you’d like a more up-to-date version of ShellCheck than what Raspbian provides, you can build your own on a Raspberry Pi Zero in a little over 21 hours.

Alternatively, as of last week, you can also download RPi compatible, statically linked armv6hf binaries of every new commit and stable release.

It’s statically linked — i.e. the executable has all its library dependencies built in — so you can expect it to be pretty big. However, I didn’t expect it to be 67MB:

build@d1044ff3bf67:/mnt/shellcheck# ls -l shellcheck
-rwxr-xr-x 1 build build 66658032 Jul 14 16:04 shellcheck

This is for a tool intended to run on devices with 512MiB RAM. strip helps shed a lot of that weight, and the post-stripped number is the one we’ll use from now on, but 36MB is still more than I expected, especially given that the x86_64 build is 23MB.

build@d1044ff3bf67:/mnt/shellcheck# strip --strip-all shellcheck
build@d1044ff3bf67:/mnt/shellcheck# ls -l shellcheck
-rwxr-xr-x 1 build build 35951068 Jul 14 16:22 shellcheck

So now what? Optimize for size? Here’s ghc -optlo-Os to enable LLVM opt size optimizations, including a complete three hour Qemu emulated rebuild of all dependencies:

build@31ef6588fdf1:/mnt/shellcheck# ls -l shellcheck
-rwxr-xr-x 1 build build 32051676 Jul 14 22:38 shellcheck

Welp, that’s not nearly enough.

The real problem is that we’re linking in both C and Haskell dependencies, from the JSON formatters and Regex libraries to bignum implemenations and the Haskell runtime itself. These have tons of functionality that ShellCheck doesn’t use, but which is still included as part of the package.

Fortunately, GCC and GHC allow eliminating this kind of dead code through function sections. Let’s look at how that works, and why dead code can’t just be eliminated as a matter of course:

An ELF binary contains a lot of different things, each stored in a section. It can have any number of these sections, each of which has a pile of attributes including a name:

  • .text stores executable code
  • .data stores global variable values
  • .symtab stores the symbol table
  • Ever wondered where compilers embed debug info? Sections.
  • Exception unwinding data, compiler version or build IDs? Sections.

This is how strip is able to safely and efficiently drop so much data: if a section has been deemed unnecessary, it’s simple and straight forward to drop it without affecting the rest of the executable.

Let’s have a look at some real data. Here’s a simple foo.c:

int foo() { return 42; }
int bar() { return foo(); }

We can compile it with gcc -c foo.c -o foo.o and examine the sections:

$ readelf -a foo.o
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
  Class:        ELF32
  Data:         2's complement, little endian
  Version:      1 (current)
  OS/ABI:       UNIX - System V
  ABI Version:  0
  Type:         REL (Relocatable file)
  Machine:      ARM
[..]

Section Headers:
  [Nr] Name       Type      Addr   Off    Size   ES Flg Lk Inf Al
  [ 0]            NULL      000000 000000 000000 00      0   0  0
  [ 1] .text      PROGBITS  000000 000034 000034 00  AX  0   0  4
  [ 2] .rel.text  REL       000000 000190 000008 08   I  8   1  4
  [ 3] .data      PROGBITS  000000 000068 000000 00  WA  0   0  1
  [ 4] .bss       NOBITS    000000 000068 000000 00  WA  0   0  1
  [..]

Symbol table '.symtab' contains 11 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
   [..]
     9: 00000000    28 FUNC    GLOBAL DEFAULT    1 foo
    10: 0000001c    24 FUNC    GLOBAL DEFAULT    1 bar

There’s tons more info not included here, and it’s an interesting read in its own right. Anyways, both our functions live in the .text segment. We can see this from the symbol table’s Ndx column which says section 1, corresponding to .text. We can also see it in the disassembly:

$ objdump -d foo.o
foo.o:     file format elf32-littlearm

Disassembly of section .text:
00000000 <foo>:
   0:   e52db004   push    {fp}
   4:   e28db000   add     fp, sp, #0
   8:   e3a0302a   mov     r3, #42 ; 0x2a
   c:   e1a00003   mov     r0, r3
  10:   e28bd000   add     sp, fp, #0
  14:   e49db004   pop     {fp}
  18:   e12fff1e   bx      lr

0000001c <bar>:
  1c:   e92d4800   push    {fp, lr}
  20:   e28db004   add     fp, sp, #4
  24:   ebfffffe   bl      0 <foo>
  28:   e1a03000   mov     r3, r0
  2c:   e1a00003   mov     r0, r3
  30:   e8bd8800   pop     {fp, pc}

Now lets say that the only library function we use is foo, and we want bar removed from the final binary. This is tricky, because you can’t just modify a .text segment by slicing things out of it. There are offsets, addresses and cross-dependencies compiled into the code, and any shifts would mean trying to patch that all up. If only it was as easy as when strip removed whole sections…

This is where gcc -ffunction-sections and ghc -split-sections come in. Let’s recompile our file with gcc -ffunction-sections foo.c -c -o foo.o:

$ readelf -a foo.o
[..]
Section Headers:
  [Nr] Name          Type      Addr  Off  Size ES Flg Lk Inf Al
  [ 0]               NULL      00000 0000 0000 00      0   0  0
  [ 1] .text         PROGBITS  00000 0034 0000 00  AX  0   0  1
  [ 2] .data         PROGBITS  00000 0034 0000 00  WA  0   0  1
  [ 3] .bss          NOBITS    00000 0034 0000 00  WA  0   0  1
  [ 4] .text.foo     PROGBITS  00000 0034 001c 00  AX  0   0  4
  [ 5] .text.bar     PROGBITS  00000 0050 001c 00  AX  0   0  4
  [ 6] .rel.text.bar REL       00000 01c0 0008 08   I 10   5  4
  [..]

Symbol table '.symtab' contains 14 entries:
   Num:    Value  Size Type    Bind   Vis      Ndx Name
[..]
12: 00000000    28 FUNC    GLOBAL DEFAULT    4 foo
13: 00000000    28 FUNC    GLOBAL DEFAULT    5 bar

Look at that! Each function now has its very own section.

This means that a linker can go through and find all the sections that contain symbols we need, and drop the rest. We can enable it with the aptly named ld flag --gc-sections. You can pass that flag to ld via gcc using gcc -Wl,--gc-sections. And you can pass that whole thing to gcc via ghc using ghc -optc-Wl,--gc-sections

I enabled all of this in my builder’s .cabal/config:

program-default-options
  gcc-options: -Os -Wl,--gc-sections -ffunction-sections -fdata-sections
  ghc-options: -optc-Os -optlo-Os -split-sections

With this in place, the ShellCheck binary became a mere 14.5MB:

-rw-r--r-- 1 build build 14503356 Jul 15 10:01 shellcheck

That’s less than half the size we started out with. I’ve since applied the same flags to the x86_64 build, which brought it down from 23MB to 7MB. Snappier downloads and installs for all!


For anyone interested in compiling Haskell for armv6hf on x86_64, I spent weeks trying to get cross-compilation going, but in the end (and with many hacks) I was only able to cross-compile armv7. In the end I gave up and took the same approach as with the Windows build blog post: a Docker image runs the Raspbian armv6 userland in Qemu user emulation mode.

I didn’t even have to set up Qemu. There’s tooling from Resin.io for building ARM Docker containers for IoT purposes. ShellCheck (ab)uses this to run emulated GHC and cabal. Everything Just Works, if slowly.

The Dockerfile is available on GitHub as koalaman/armv6hf-builder.