Why Bash Is Like That: Rewrite hacks

Bash can seem pretty random and weird at times, but most of what people see as quirks have very logical (if not very good) explanations behind them. This series of posts looks at some of them.

Let’s say you wanted to enforce a policy in which no files on the system could contain swearing. How would you write a script that checks it? Let’s use the word “damn”, and let’s write a script “checklanguage” that checks whether a file contains that word.

Our first version might be:

#!/usr/bin/env bash
grep -q "damn" "$@" 

The problem with this is that it triggers on itself: ./checklanguage checklanguage returns true. How can we write the script in such a way that it reliably detects the word, but doesn’t detect itself? (Think about it for a second).

There are many ways of doing this: a="da"; b="mn"; grep "$a$b", grep "da""mn", grep da\mn. All of these check for the four characters d-a-m-n in sequence, but doesn’t contain the sequence itself. These methods rely on two things being A. identical in one context (shell script) and B. different in another (plaintext).

This type of trick is the basis of three common command line hacks:

Finding processes from ps, while excluding the grep that does the filtering.

If we do a simple ps ax | grep processname, we might get output like this:

$ ps ax | grep processname
13003 pts/2    S      0:00 /bin/bash ./processname
13496 pts/4    R+     0:00 grep --color=auto processname

How do we get the same list, but without the grep process? You’ll see people wrapping the first character in square brackets:

$ ps ax | grep "[p]rocessname"
13003 pts/2    S      0:00 /bin/bash ./processname

In this case, the regex “[p]rocessname” is identical to the regex “processname”, but since they’re written differently, the latter matches itself while the former doesn’t. This means that the grep won’t match itself, and we only get the process we’re interested in (this job is better done by pgrep).

There is no syntax rule that says “if the first character is enclosed in square brackets, grep shall ignore itself in ps output”.

It’s just a logical side effect of rewriting the regex to work the same but not match itself. We could have used grep -E 'process()name' or grep -E 'proces{2}name' instead.

Running commands instead of aliases

Maybe you’re sick of Debian’s weird perl rename, and you aliased it to rename.ul instead.

$ rename -v .htm .html *
`foo.htm' -> `foo.html'

Yay, that’s way easier than writing regex! But what if we need to use the unaliased rename?

$ rename -v 's/([1-9])x([0-9]*)/S$1E$2/' *
rename.ul: not enough arguments

Instead, you’ll see people prefixing the command with a backslash:

$ \rename -v 's/([1-9])x([0-9]*)/S0$1E$2/' *
Foo_1x20.mkv renamed as Foo_S01E20.mkv

Shell aliases trigger when a command starts with a word. However, if the command starts with something that expands into a word, alias expansion does not apply. This allows us to use e.g. \ls or \git to run the command instead of the alias.

There is no syntax rule that says that “if a command is preceded by a backslash, alias expansion is ignored”.

It’s just a logical side effect of rewriting the command to work the same, but not start with a literal token that the shell will recognize as an alias. We could also have used l\s or 'ls'.

Deleting files starting with a dash

How would you go about deleting a file that starts with a dash?

$ rm -v -file
rm: invalid option -- 'l'

Instead, you’ll see people prefixing the filename with ./:

$ rm -v ./-file
removed `./-file'

A command will interpret anything that starts with a dash as a flag. However, to the file system, -file and ./-file mean exactly the same thing.

There is no syntax rule that says that “if an argument starts with ./, it shall be interpretted as a filename and not an option”.

It’s just a logical side effect of rewriting a filename to refer to the same file, but start with a different character. We could have used rm /home/me/-file or rm ../me/-file instead.

Homework: What do you tell someone who thinks that ./myscript is a perfect example of how weird UNIX is? Why would anyone design a system where the run command is “./” instead of “run”?

ShellCheck: shell script analysis

Shell scripting is notoriously full of pitfalls, unintuitive behavior and poor error messages. Here are some things you might have experienced:

  • find -exec fails on commands that are perfectly valid
  • 0==1 is apparently true
  • Comparisons are always false, and write files while failing
  • Variable values are available inside loops, but reset afterwards
  • Looping over filenames with spaces fails, and quoting doesn’t help


ShellCheck is my latest project. It will check shell scripts for all of the above, and also tries to give helpful tips and suggestions for otherwise working ones. You can paste your script and have it checked it online, or you can downloaded it and run it locally.

Other things it checks for includes reading from and redirecting to a file in the same pipeline, useless uses of cat, apparent variable use that won’t expand, too much or too little quoting in [[ ]], not quoting globs passed to find, and instead of just saying “syntax error near unexpected token `fi'”, it points to the relevant if statement and suggests that you might be missing a ‘then’.

It’s still in the early stages, but has now reached the point where it can be useful. The online version has a feedback button (in the top right of your annotated script), so feel free to try it out and submit suggestions!

Why Bash is like that: Subshells

Bash can seem pretty random and weird at times, but most of what people see as quirks have very logical (if not very good) explanations behind them. This series of posts looks at some of them.

# I run this script, but afterwards my PATH and current dir hasn't changed!

export PATH=$PATH:/opt/local/bin
cd /opt/games/

or more interestingly

# Why does this always say 0? 
cat file | while read line; do (( n++ )); done
echo $n

In the first case, you can add a echo "Path is now $PATH", and see the expected path. In the latter case, you can put a echo $n in the loop, and it will count up as you’d expect, but at the end you’ll still be left with 0.

To make things even more interesting, here are the effects of running these two examples (or equivalents) in different shells:

set in script set in pipeline
Bash No effect No effect
Ksh/Zsh No effect Works
cmd.exe Works No effect

What we’re experiencing are subshells, and different shells have different policies on what runs in subshells.

Environment variables, as well as the current directory, is only inherited parent-to-child. Changes to a child’s environment are not reflect in the parent. Any time a shell forks, changes done in the forked process are confined to that process and its children.

In Unix, all normal shells will fork to execute other shell scripts, so setting PATH or cd’ing in a script will never have an effect after the command is done (instead, use "source file" aka ". file" to read and execute the commands without forking).

However, shells can differ in when subshells are invoked. In Bash, all elements in a pipeline will run in a subshell. In Ksh and Zsh, all except the last will run in a subshell. POSIX leaves it undefined.

This means that echo "2 + 3" | bc | read sum will work in Ksh and Zsh, but fail to set the variable sum in Bash.

To work around this in Bash, you can usually use redirection and process substition instead:

read sum < <(echo "2 + 3" | bc)

So, where do we find subshells? Here are a list of commands that in some way fails to set foo=bar for subsequent commands (note that all the examples set it in some subshell, and can use it until the subshell ends):

# Executing other programs or scripts
foo=bar ./something

# Anywhere in a pipeline in Bash
true | foo=bar | true

# In any command that executes new shells
awk '{ system("foo=bar") }'h
find . -exec bash -c 'foo=bar' \;

# In backgrounded commands and coprocs:
foo=bar &
coproc foo=bar

# In command expansion
true "$(foo=bar)"

# In process substitution
true < <(foo=bar)

# In commands explicitly subshelled with ()
( foo=bar )

and probably some more that I'm forgetting.

Trying to set a variable, option or working dir in any of these contexts will result in the changes not being visible for following commands.

Knowing this, we can use it to our advantage:

# cd to each dir and run make
for dir in */; do ( cd "$dir" && make ); done

# Compare to the more fragile
for dir in */; do cd "$dir"; make; cd ..; done

# mess with important variables
fields=(a b c); ( IFS=':'; echo ${fields[*]})

# Compare to the cumbersome
fields=(a b c); oldIFS=$IFS; IFS=':'; echo ${fields[*]}; IFS=$oldIFS; 

# Limit scope of options
( set -e; foo; bar; baz; ) 

Why Bash is like that: Signal propagation

Bash can seem pretty random and weird at times, but most of what people see as quirks have very logical (if not very good) explanations behind them. This series of posts looks at some of them.

How do I simulate pressing Ctrl-C when running this in a script:
while true; do echo sleeping; sleep 30; done

Are you thinking “SIGINT, duh!”? Hold your horses!

I tried kill -INT pid, but it doesn't work the same:

Ctrl-C    kills the sleep and the loop
SIGINTing the shell does nothing (but only in scripts: see Errata)
SIGINTing sleep makes the loop continue with the next iteration

HOWEVER, if I run the script in the background and kill -INT %1
instead of kill -INT pid, THEN it works :O

Why does Ctrl-C terminate the loop, while SIGINT doesn’t?

Additionally, if I run the same loop with ping or top instead of sleep,
Ctrl-C doesn't terminate the loop either!

Yeah. Well… Yeah…

This behaviour is due to an often overlooked feature in UNIX: process groups. These are important for getting terminals and shells to work the way they do.

A process group is exactly what it sounds like: a group of processes. They have a leader, which is the process that created it using setsid(2). The leader’s pid is also the process group id. Child processes are in the same group as their parent by default.

Terminals keep track of the foreground process group (set by the shell using tcsetpgrp(3)). When receiving a Ctrl-C, they send the SIGINT to the entire foreground group. This means that all members of the group will receive SIGINT, not just the immediate process.

kill -INT %1 sends the signal to the job’s process group, not the backgrounded pid! This explains why it works like Ctrl-C.

You can do the same thing with kill -INT -pgrpid. Since the process group id is the same as the process group leader, you can kill the group by killing the pid with a minus in front.

But why do you have to kill both?

When the shell is interrupted, it will wait for the running command to exit. If this child’s status indicates it exited abnormally due to that signal, the shell cleans up, removes its signal handler, and kills itself again to trigger the OS default action (abnormal exit). Alternatively, it runs the script’s signal handler as set with trap, and continues.

If the shell is interrupted and the child’s status says it exited normally, then Bash assumes the child handled the signal and did something useful, so it continues executing. Ping and top both trap SIGINT and exit normally, which is why Ctrl-C doesn’t kill the loop when calling them.

This also explains why interrupting just the shell does nothing: the child exits normally, so the shell thinks the child handled the signal, though in reality it was never received.

Finally, if the shell isn’t interrupted and a child exits, Bash just carries on regardless of whether the signal died abnormally or not. This is why interrupting the sleep just continues with the loop.

In case one would like to handle such cases, Bash sets the exit code to 128+signal when the process exits abnormally, so interrupting sleep with SIGINT would give the exit code 130 (kill -l lists the signal values).

Bonus problem:

I have this C app, testpg:
int main() {
    return sleep(10);

I run bash -c './testpg' and press Ctrl-C. The app is killed.
Shouldn't testpg be excluded from SIGINT, since it used setsid?

A quick strace unravels this mystery: with a single command to execute, bash execve’s it directly — a little optimization trick. Since the pid is the same and already had its own process group, creating a new one doesn’t have any effect.

This trick can’t be used if there are more commands, so bash -c './testpg; true' can’t be killed with Ctrl-C.


Wait, I started a loop in one terminal and killed the shell in another. 
The loop exited!

Yes it did! This does not apply to interactive shells, which have different ways of handling signals. When job control is enabled (running interactively, or when running a script with bash -m), the shell will die when SIGINTed

Here’s the description from the bash source code, jobs.c:2429:

  /* Ignore interrupts while waiting for a job run without job control
     to finish.  We don't want the shell to exit if an interrupt is
     received, only if one of the jobs run is killed via SIGINT. 

Why Bash is like that: suid

Bash can seem pretty random and weird at times, but most of what people see as quirks have very logical (if not very good) explanations behind them. This series of posts looks at some of them.

Why can't bash scripts be SUID?

Bash scripts can’t run with the suid bit set. First of all, Linux doesn’t allow any scripts to be setuid, though some other OS do. Second, bash will detect being run as setuid, and immediately drop the privileges.

This is because shell script security is extremely dependent on the environment, much more so than regular C apps.

Take this script, for example, addmaildomain:

[[ $1 ]] || { man -P cat $0; exit 1; } 

if grep -q "^$(whoami)\$" /etc/accesslist
    echo "$1" > /etc/mail/local-host-names
    echo "You don't have permissions to add hostnames"

The intention is to allow users in /etc/accesslist to run addmaildomain example.com to write new names to local-host-names, the file which defines which domains sendmail should accept mail for.

Let’s imagine it runs as suid root. What can we do to abuse it?

We can start by setting the path:

echo "rm -rf /" > ~/hax/grep && chmod a+x ~/hax/grep
PATH=~/hax addmaildomain

Now the script will run our grep instead of the system grep, and we have full root access.

Let’s assume the author was aware of that, had set PATH=/bin:/usr/bin as the first line in the script. What can we do now?

We can override a library used by grep

gcc -shared -o libc.so.6 myEvilLib.c
LD_LIBRARY_PATH=. addmaildomain

When grep is invoked, it’ll link with our library and run our evil code.

Ok, so let’s say LD_LIBRARY_PATH is closed up.

If the shell is statically linked, we can set LD_TRACE_LOADED_OBJECTS=true. This will cause dynamically linked executables to print out a list of library dependencies and return true. This would cause our grep to always return true, subverting the test. The rest is builtin and wouldn’t be affected.

Even if the shell is statically compiled, all variables starting with LD_* will typically be stripped by the kernel for suid executables anyways.

There is a delay between the kernel starting the interpretter, and the interpretter opening the file. We can try to race it:

while true
    ln /usr/bin/addmaildomain foo
    nice -n 20 ./foo &
    echo 'rm -rf /' > foo

But let’s assume the OS uses a /dev/fd/* mechanism for passing a fd, instead of passing the file name.

We can rename the script to confuse the interpretter:

ln /usr/bin/addmaildomain ./-i

Now we’ve created a link, which retains suid, and named it “-i”. When running it, the interpretter will run as “/bin/sh -i”, giving us an interactive shell.

So let’s assume we actually had “#!/bin/sh –” to prevent the script from being interpretted as an option.

If we don’t know how to use the command, it helpfully lists info from the man page for us. We can compile a C app that executes the script with “$0” containing “-P /hax/evil ls”, and then man will execute our evil program instead of cat.

So let’s say “$0” is quoted. We can still set MANOPT=-H/hax/evil.

Several of these attacks were based on the inclusion of ‘man’. Is this a particularly vulnerable command?

Perhaps a bit, but a whole lot of apps can be affected by the environment in more and less dramatic ways.

  • POSIXLY_CORRECT can make some apps fail or change their output
  • LANG/LC_ALL can thwart interpretation of output
  • LC_CTYPE and LC_COLLATE can modify string comparisons
  • Some apps rely on HOME and USER
  • Various runtimes have their own paths, like RUBY_PATH, LUA_PATH and PYTHONPATH
  • Many utilities have variables for adding default options, like RUBYOPT and MANOPT
  • Tools invoke EDITOR, VISUAL and PAGER under various circumstances

So yes, it’s better not to write suid shell scripts. Sudo is better than you at running commands safely.

Do remember that a script can invoke itself with sudo, if need be, for a simulated suid feel.

So wait, can’t perl scripts be suid?

They can indeed, but there the interpretter will run as the normal user and detect that the file is suid. It will then run a suid interpretter to deal with it.

Why Bash is like that: Builtin or not

Bash can seem pretty random and weird at times, but most of what people see as quirks have very logical (if not very good) explanations behind them. This series of posts looks at some of them.

# Why don't the options in "man time" work?
time -f %w myapp

Short answer: ‘time’ runs the builtin version, ‘man time’ shows the external version

time is a builtin in the shell, as well as an external command (this also goes for kill, pwd, and test). The man time shows info about the external command, while help time shows the internal one.

To run the external version, one can use command time or /usr/bin/time or just \time.

The reason why time is built in is so timing pipelines will work properly. time true | sleep 10 would say 0 seconds with an external command (which can’t know what it’s being piped into), and while the internal version can say 10 seconds since it knows about the whole pipeline.

POSIX leaves the behaviour of time a | b undefined.

# This finds the full path to ls. Why isn't there a 'man type'?
type -P ls

type is a bash builtin, not an external command. This allows it to take shell functions and aliases into account, something whereis can’t.

Builtins are documented in man bash, or more conveniently, “help type” (help is also a builtin).