Buffers and windows: The mystery of ‘ssh’ and ‘while read’ in excessive detail

If you’ve ever tried to use ssh, and similarly ffmpeg or mplayer, in a while read loop, you’ll have stumbled upon a surprising interaction: the loop mysteriously aborts after the first iteration!

The solution, using ssh -n or ssh < /dev/null, is quickly spoiled by ShellCheck (shellcheck this code online), but why stop there? Let’s take a deep dive into the technical details surrounding this issue.

Note that all numbers given here are highly tool and platform specific. They apply to GNU coreutils 8.26, Linux 4.9 and OpenSSH 7.4, as found on Debian in July, 2017. If you use a different platform, or even just sufficiently newer versions, and wish to repeat the experiments, you may have to follow the reasoning and update the numbers accordingly.

Anyways, to first demonstrate the problem, here’s a while read loop that runs a command for each line in a file:

while IFS= read -r host
  echo ssh "$host" uptime
done < hostlist.txt

It works perfectly and iterates over all lines in the file:

ssh localhost uptime
ssh uptime
ssh uptime

However, if we remove the echo and actually run ssh, it will stop after the first iteration with no warnings or errors:

 16:12:41 up 21 days,  4:24, 12 users,  load average: 0.00, 0.00, 0.00

Even uptime itself works fine, but ssh localhost uptime will stop after the first one, even though it runs the same command on the same machine.

Of course, applying the aforementioned fix, ssh -n or ssh < /dev/null solves the problem and gives the expected result:

16:14:11 up 21 days,  4:24, 12 users,  load average: 0.00, 0.00, 0.00
16:14:11 up 14 days,  6:59, 15 users,  load average: 0.00, 0.00, 0.00
01:14:13 up 73 days, 13:17,  8 users,  load average: 0.08, 0.15, 0.11

If we merely wanted to fix the problem though, we'd just have followed ShellCheck's advice from the start. Let's keep digging.

You see similar behavior if you try to use ffmpeg to convert media or mplayer to play it. However, now it's even worse: not only does it stop after one iteration, it may abort in the middle of the first one!

All other commands work fine -- even other converters, players and ssh-based commands like sox, vlc and scp. Why do certain commands fail?

The root of the problem is that when you pipe or redirect to a while read loop, you're not just redirecting to read but to the entire loop body. Everything in both condition and body will share the same file descriptor for standard input. Consider this loop:

while IFS= read -r line
  cat > rest
done < file.txt

First read will successfully read a line and start the first iteration. Then cat will read from the same input source, where read left off. It reads until EOF and exits, and the loop iterates. read again tries to read from the same input, which remains at EOF. This terminates the loop. In effect, the loop only iterated once.

The question remains, though: why do our three commands in particular drain stdin?

ffmpeg and mplayer are simple cases. They both accept keyboard controls from stdin.

While ffmpeg encodes a video, you can use '+' to make the process more verbose or 'c' to input filter commands. While mplayer plays a video, you can use 'space' to pause or 'm' to mute. The commands drain stdin while processing these keystrokes.

They both also share a shortcut to quit: they will stop abruptly if any of the input they read is a "q".

But why ssh? Shouldn't it mirror the behavior of the remote command? If uptime doesn't read anything, why should ssh localhost uptime?

The Unix process model has no good way to detect when a process wants input. Instead, ssh has to preemptively read data, send it over the wire, and have sshd offer it on a pipe to the process. If the process doesn't want it, there's no way to return the data to the FD from whence it came.

We get a toy version of the same problem with cat | uptime. Output in this case is the same as when using ssh localhost uptime:

 16:25:51 up 21 days,  4:34, 12 users,  load average: 0.16, 0.03, 0.01

In this case, cat will read from stdin and write to the pipe until the pipe's buffer is full, at which time it'll block until something reads. Using strace, we can see that GNU cat from coreutils 8.26 uses a 128KiB buffer -- more than Linux's current 64KiB pipe buffer -- so one 128KiB buffer is the amount of data we can expect to lose.

This implies that the loop doesn't actually abort. It will continue if there is still data left after 128KiB has been read from it. Let's try that:

  echo first
  for ((i=0; i < 16384; i++)); do echo "garbage"; done
  echo "second"
} > file

while IFS= read -r line
  echo "Read $line"
  cat | uptime > /dev/null
done < file

Here, we write 16386 lines to the file. "first", 16384 lines of "garbage", followed by "second". "garbage" + linefeed is 8 bytes, so 16384 of them make up exactly 128KiB. The file prevents any race conditions between producer and consumer.

Here's what we get:

Read first
Read second

If we add a single line additional line of "garbage", we'll see that instead. If we write one less, "second" disappears. In other words, the expected 128KiB of data were lost between iterations.

ssh has to do the same thing, but more: it has to read input, encrypt it, and transmit it over the wire. On the other side, sshd receives it, decrypts it, and feeds it into the pipe. Both sides work asynchronously in duplex mode, and one side can shut down the channel at any time.

If we use ssh localhost uptime we're racing to see how much data we can push before sshd notifies us that the command has already exited. The faster the computer and slower the roundtrip time, the more we can write. To avoid this and ensure deterministic results, we'll use sleep 5 instead of uptime from now on.

Here's one way of measuring how much data we write:

$ tee >(wc -c >&2) < /dev/zero | { sleep 5; }

Of course, by showing how much it writes, it doesn't directly show how much sleep reads: the 65536 bytes here is the Linux pipe buffer size.

This is also not a general way to get exact measurements because it relies on buffers aligning perfectly. If nothing is reading from the pipe, you can successfully write two blocks of 32768 bytes, but only one block of 32769.

Fortunately, GNU tee currently uses a buffer size of 8192, so given 8 full reads, it will perfectly fill the 65536 byte pipe buffer. strace also reveals that ssh (OpenSSH 7.4) uses a buffer size of 16384, which is exactly 2x of tee and 1/4x of the pipe buffer, so they will all align nicely and give an accurate count.

Let's try it with ssh:

$ tee >(wc -c >&2) < /dev/zero | ssh localhost sleep 5

As discussed, we'll subtract the pipe buffer, so we can surmise that 2162688 bytes has been read by ssh. We can verify this manually with strace if we want. But why 2162688?

On the other side, sshd has to feed this data into sleep through a pipe with no readers. That's another 65536. We're now left with 2097152 bytes. How can we account for these?

This number is in fact the OpenSSH transport layer's default window size for non-interactive channels!

Here's an excerpt from channels.h in the OpenSSH source code:

/* default window/packet sizes for tcp/x11-fwd-channel */
#define CHAN_SES_PACKET_DEFAULT	(32*1024)

There it is: 64*32*1024 = 2097152.

If we adapt our previous example to use ssh anyhost sleep 5 and write "garbage"
(64*32*1024+65536)/8 = 270336 times, we can again game the buffers and cause our iterator to get exactly the lines we want:

  echo first
  for ((i=0; i < $(( (64*32*1024 + 65536) / 8)); i++)); do echo "garbage"; done
  echo "second"
} > file

while IFS= read -r line
  echo "Read $line"
  ssh localhost sleep 5
done < file

Again, this results in:

Read first
Read second

An entirely useless experiment of course, but pretty nifty!

Leave a Reply

Your email address will not be published.