Pattern matching with Bash (not grep)

Pattern matching, either on file names or variable contents, is something Bash can do faster and more accurately by itself than with grep. This post tersely describes some cases where bash’s own pattern matching can help, by being faster, easier or better.

Simple substring search on variables

# Check if a variable contains 'foo'. Just to warm up.

# Works
if echo "$var" | grep -q foo
if [[ "$(echo $var | grep foo))" == "" ]]

# Easier and faster 
if [[ $var == *foo* ]] 

The latter runs several hundred times faster by saving two forks (good to know when looping), and the code is cleaner and clearer.

Mixed pattern/fixed string search on variables

This is a less common but more interesting case.

#Check if /usr/bin overrides our install dir

# Mostly works (Can fail if $installdir contains 
# regex characters like . * [ ] etc)
if echo "$PATH" | grep -q "/usr/bin:.*:$installdir"

# Quoted strings will not be interpreted as globs
if [[ $PATH == */usr/bin:*:"$installdir" ]] 

We want parts of our input to be interpreted as regex, and parts to be literal, so neither grep nor fgrep entirely fits. Manually trying to escape regex chars is icky at best. We end up chancing that people won’t use the script anywhere near weirdly named files (like, in their torrent directory). With globs, bash doesn’t have to reparse the variable contents as part of the pattern, and just knows to take quoted strings literally.

Of course, you see how both the above fails to account for cases like /usr/bin:$installdir. This is not something you can easily express in traditional globs, but bash does regex too, and the semantics of quotes remain the same (since 3.2 or so at least):

# Quoted strings will not be interpreted as regex either
if [[ $PATH =~ (^|.*:)/usr/bin(:|:.*:)"$dir"(:.*|$) ]]

Matching file names

I’ll skip the trivial examples for things like `ls | grep .avi$`. Here is a case where traditional globs don’t cut it:

# Copy non-BBC .avi files, and fail on half a dozen different cases
cp $(ls *.avi | grep -v BBC) /stuff

Bash has another form of regular expressions, extglobs (enable with shopt -s extglob). These are mathematically regular, but don’t follow the typical unix regex syntax:

 
# Copy non-BBC .avi files without making a mess 
# when files have spaces or other weird characters
cp !(*BBC*).avi /stuff

man bash contains enough on extglob, so I’d just like to point out one thing. grep -v foo can be replaced by !(foo), which strives to reject “foo” (unlike [^f][^o][^o] and similar attempts which strive to accept). egrep "foo|bar" can be replaced by @(foo|bar) to match one of the patterns. But how about grep foo | grep bar to match both?

That’s our old friend De Morgan: !(@(!(foo)|!(bar))). Don’t you just love that guy?

PS: If you don’t already use parameter expansion to do simple trimming and replacement on variables, now could be a good time to look up that and probably save yourself a lot of sed too.

Multithreading for performance in shell scripts

Now that everyone and their grandmother have at least two cores, you can double the efficiency by distributing the workload. However, multithreading support in pure shell scripts is terrible, even though you often do things that can take a while, like encoding a bunch of chip tunes to ogg vorbis:

mkdir ogg
for file in *.mod
do
	xmp -d wav -o - "$file" | oggenc -q 3 -o "ogg/$file.ogg"
done

This is exactly the kind of operation that is conceptually trivial to parallelize, but not obvious to implement in a shell script. Sure, you could run them all in the background and wait for them, but that will give you a load average equal to the number of files. Not fun when there are hundreds of files.

You can run two (or however many) in the background, wait and then start two more, but that’ll give terrible performance when the jobs aren’t of roughly equal length, since at the end, the longest running job will be blocking the other eager cores.

Instead of listing ways that won’t work, I’ll get to the point: GNU (and FreeBSD) xargs has a -P for specifying the number of jobs to run in parallel!

Let’s rewrite that conversion loop to parallelize

mod2ogg() { 
	for arg; do xmp -d wav -o - "$arg" | oggenc -q 3 -o "ogg/$arg.ogg" -; done
}
export -f mod2ogg
find . -name '*.mod' -print0 | xargs -0 -n 1 -P 2 bash -c 'mod2ogg "$@"' -- 

And if we already had a mod2ogg script, similar to the function just defined, it would have been simpler:

find . -name '*.mod' -print0 | xargs -0 -n 1 -P 2 mod2ogg

Voila. Twice as fast, and you can just increase the -P with fancier hardware.

I also added -n 1 to xargs here, to ensure an even distribution of work. If the work units are so small that executing the command starts becoming a sizable portion of it, you can increase it to make xargs run mod2ogg with more files at a time (which is why it’s a loop in the example).