Vidar – Vidar's Blog

The key to understanding “Dynamic Programming” is that it’s not referring to “computer programming”

When seeing the phrase “dynamic programming” in an algorithms class or leetcode study guide, the first question people ask is “what does ‘dynamic’ mean in this context?”.

The key question is instead “what does ‘programming’ mean in this context?”, because it does not mean “computer programming”.

Instead it refers to, as the Oxford English Dictionary puts it,

programming. n.

4. Planning carried out for purposes of control, management, or administration.

So really, it’s closer to “TV programming” (as in creating a schedule), than “computer programming” (as in writing software).

The term was coined in the 1950s at a time when civil engineers would say they’re “programming a new office building” to mean they’re “planning the order and timeline for each sub-step required to construct a new office building”.

Ignoring a million details, the resulting plan (the “program”) would be something like

Site preparation
Excavation
Foundation
Structural framework
Exterior
Roofing
Interior
Hvac/electrical/plumbing
Fixtures/furnishing
Final inspection

(Real engineers can rip this example to shreds in the comments of whichever site they came from.)

The steps of the plan must necessarily be in dependency order, because each step likely relies on previous steps be done. Electricians will fail to complete their work if they arrive at the worksite while the loggers are still clearing trees.

It shouldn’t need stating, but you wouldn’t pour three separate concrete foundations if you have three steps that need a foundation. You’d do it once so that it’s ready for anything that needs it. That’s the point of ordering the steps.

Dynamic Programming in Computer Science similarly means “planning the order of each sub-step required to solve the complete problem”.

When “programming” fibonacci(10), the “program” would be something like (given that we already have fib(0) and fib(1) by definition):

fib(2)
fib(3)
fib(4)
fib(5)
fib(6)
fib(7)
fib(8)
fib(9)
fib(10)

The steps of the plan must necessarily be in dependency order, because each step likely relies on previous steps to be done. fib(7) will not be able to get its work done if fib(5) or fib(6) are not ready.

It shouldn’t need stating, but you wouldn’t compute fib(8) three separate times if you have three steps that need it. You’d do it once so that it’s ready for anything that needs it. That’s the point of ordering the steps.

This “program” could be planned top down, i.e. starting with fib(10) and determining that you first need to solve fib(9) and fib(8), and repeating the process recursively.

It could also be planned bottom up, by noticing the pattern that every fibonacci number will have to be computed in order anyways, so you might as well start from 0 and work your way up.

Either way, the final plan is the same, and both are considered Dynamic Programming.

So next time you see someone struggling to understand what could possibly make Dynamic Programming “dynamic”, suggest they instead look into what makes it “programming”.

The term “Dynamic Programming” was coined by Richard Bellman in the 1950s, long before anyone knew how confusing it would become. In his autobiography, he describes how he ended up with this name:

The 1950s were not good years for mathematical research.

We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word, research.

I’m not using the term lightly; I’m using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term, research, in his presence.

You can imagine how he felt, then, about the term, mathematical.

The RAND Corporation was employed by the Air Force, and the Air Force had Wilson as its boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation.

What title, what name, could I choose?

In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word, “programming”.

I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying I thought, lets kill two birds with one stone. Lets take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that is its impossible to use the word, dynamic, in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It’s impossible*.

Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities.

*This Haskell fan’s contender is “dynamic typing” :P

That time I wasted weeks hand optimizing assembly because I benchmarked on random data

Once upon a time I worked in the field of Java Optimizations.

The target system was a distributed data processing platform that ran across hundreds of thousands of machines. At such scales, a 0.5% improvement would easily make up my salary going forward, and 2% was a good result for the half.

That doesn’t mean it was easy. Never have I ever seen such a highly optimized Java codebase. Not before, not since.

Every low hanging fruit had long since been picked clean. For example, there was minimal use of java.lang.String because ain’t nobody got time for a second allocation of a char[] with associated indirection and GC churn.

In other words, we were looking at increasingly heroic optimizations to justify our own existence.

Through careful, large scale profiling, I discovered that data serialization was a small but worthwhile place to focus. In particular, VarInt encoding used enough CPU time to warrant investigation.

If you’re not familiar, a VarInt, or more specifically ULEB128, is a byte efficient way to encode arbitrary integers. You use however many groups of 7 bits you need for your number, and use the 8th bit to specify whether there is another byte coming.

Numbers needing 7-bits or less (< 128) are encoded as 0nnnnnn

14-bit (<16384), 1nnnnnnn 0nnnnnnn

21bit (<2097152), 1nnnnnnn 1nnnnnnn 0nnnnnnn

And so on, in little-endian order.

For a 32bit int, you will need between 1 and 5 bytes. For a 64bit long, you will need between 1 and 10.

They are used in Google Protobuf, Apache/Facebook Thrift, WASM, DWARF, and other formats that want a compact binary representation without sacrificing range.

The interface the system used for this encoding was essentially int encodeLong(byte[] buffer, int offset, long data), returning the number of bytes written.

Conventional wisdom is that this is way too small to benefit from native code.

But conventional wisdom is for conventional development.

We were in the early phases of forking the JVM, had recently built the infrastructure to implement custom optimizations, and we were itching to use it.

Engineers don’t exactly need convincing to work on fun assembly puzzles to begin with, but the empire builders in management were additionally competing in a Manifest Destiny style land grab as the company maneuvered to take control of its tech stack from end to end. They were pushing this as hard as anyone to stake their claims.

So off I went to write an ultra optimized VarInt encoder that the JVM JIT could serialize straight into the instruction stream, bypassing the normal overhead associated with JNI.

And I sure did.

My hand coded, branchless, SIMD assembly algorithm performed 4x better than the already highly optimized Java implementation!

(The exact details escape me, but it was some combination of using BMI2 to scatter bits into groups of 7, and some AVX2 comparisons and bit extensions to compute the continuation mask and determine the resulting length.)

This was verified in a benchmark that ran both versions on billions of random numbers, confirming both the huge speedup and the bit-for-bit compatibility of the result.

I spent probably two weeks implementing, testing, integrating, polishing, and landing this change.

Finally I tried to measure the difference in prod.

Nothing.

Measuring ~1% improvements is always a challenge, but even with huge A/B runs, the difference was exactly nothing.

I did some investigations, and eventually found why my benchmarks were so misleading compared to real world performance.

It was because I was using random numbers.

As it turns out, mathematically speaking, most numbers are huge. By an infinite margin. There are dramatically many more 18 digit numbers, than there are 3 digit numbers.

In fact, if you generate random 64bit unsigned longs and try to encode them as ULEB128,

50% will need 10 bytes
~49.6% will need 9 bytes
~0.38% will need 8 bytes
~0.003% will need 7 bytes
…
~0.00000000000009% will need 2 bytes
~0.0000000000000006% will need 1 byte

Meanwhile, if I look around my real world right now, I see several dozen numbers. Almost all fit into two bytes with room to spare. A “60” watt charger, an old Covid-“19” test, page “1” of “2” of some statement. A few zip codes and reference numbers break the pattern, but overall, real numbers tend to be small.

That’s exactly why people use VarInt in the first place.

The Java version had to do more work the more bytes it had to output. By measuring on random numbers, I was inadvertently doing all my testing against the algorithm’s unrealistic worst case scenario.

When I adjusted my benchmark to primarily generate small numbers, the speedup vanished entirely.

I ended up rolling back the change, and wrote off the time spent as a proof-of-concept for developing and productionizing a custom JIT optimization.

(And I crossed “SIMD” off my programming bingo card.)

FiniteCurve.com: TSP art on a CPU budget

tl;dr: My hobby project FiniteCurve.com is a web app that draws an image with a single, long, non-intersecting line. This post explains the process.

Like many engineers, I’ve had a long standing fascination with mazes. One manifestation of this was elementary school drawings of ever-winding but never overlapping lines, like this:

They were fascinating to look at but tedious to draw. When I learned programming a few years later, I wrote many grid-based maze generators. However, while I’ve pondered the problem every several years since, I never ended up writing one that could replicate that organic, free-flowing, space filling style.

That is, until recently.

After 20 years, my pondering coincided with a blank spot on my wall, and I finally decided to implement it. Here’s Central Europe drawn with a single, ever-winding line. Click to see the entire world (10000×5000, 4MB).

The actual artwork I put on my wall — which looks better in person but doesn’t convey the concept as convincingly in a blog post — was of my guinea pig. Maze obsession aside, my favorite part is how discoverable it is:

Guinea pig drawn with a single line

It looks like a grayscale photo from a distance, but as you get closer you realize that it’s made up of something, thereby inviting you to get closer still until you have your nose pressed up against it, tracing the 160 meter line around.

I never intended to publish the program, but the result was striking enough that I compiled the C++ code to Wasm via emscripten, added a scrappy frontend, and put it online. You can play around with it on FiniteCurve.com. The source code is koalaman/finitecurve.com on GitHub.

I was not aware of this until I after I finished it, but this style closely resembles what’s known as "TSP art". Traditionally, it’s done by laying out points with a density according to a grayscale image, solving the Traveling Salesman Problem, and graphing the result.

Of course, even approximating TSP is hard, so generating such images takes minutes for smaller ones and days for large ones.

Since I only cared about the visual impression and not about the optimality or completeness of the resulting path, I just used some ad-hoc heuristics. The tool runs in seconds.

So how does it work?

The general approach is to generate a point cloud with a density matching a grayscale input image, and finding something close to a Hamiltonian path that doesn’t cross itself.

It’s a short description, but there are multiple noteworthy details:

The path does not need to be short, as it would in the Travelling Salesman Problem. Arguably, the longer the better.
The path must not cross itself. A TSP solution under triangle inequality will get this for free, but we have to work for it.
The path does not need to be complete. In fact, it shouldn’t be. Any far-away point is better left unvisited to avoid causing unsightly artifacts. Unvisited central points will merely cause small visual glitches.
It’s a path and not a cycle. Not because it’s easy, but because it’s hard. A visual cycle can trivially be created by tracing the circumference of a spanning tree in O(n) time, and where’s the fun in that?

To generate a point cloud, I simply went over each pixel in a grayscale input and added a vertex there if there wasn’t already one within some distance d determined by the shade of the pixel:

I then added edges from each node to all nearby nodes, as long as the edge did not cross an existing edge. This way, there would be no crossing edges, so any Hamiltonian path would avoid crossing itself:

I then designated an arbitrary start point and an end point. The exact points don’t matter, but for ease of locating them, I picked the left-most and right-most points, and used plain ol’ depth first search to find a path between them:

I would then iteratively expand the path. For each pair of points (p1, p2) along the path, I would find a neighbor n of p1 and try to find a path from n to p2. If found, I would expanded the path from p1->p2 to p1->n->...->p2, which is guaranteed to be longer in terms of nodes. This way, the path gradually lengthened to cover most of the nodes:

This is enough for well connected graphs.

For disjoint graphs, such as a world map, I simply added the shortest edge that would connect the disjoint regions together. This ensures that a path exists between the start and end, but necessarily leaves large island unconnected since there is only a single bridge there:

If the path at any point touched a large, unexplored section of the graph, I simply added another edge from the neighbor or neighbor’s neighor to that point. That way, any such island would slowly become reachable:

Finally, the curve is smoothed from simple, linear segments into Catmull–Rom splines, and then mapped into Cubic Bezier splines that can easily be rendered in SVG.

There is some additional polish like ignoring areas under a certain size or that are just a long, thin line of points, shuffling to try to avoid holes in the middle of a graph, and pruning edges coming out of a single vertex that are too close in angle, but really — that’s it.

I had a lot of fun hacking this project together, and I still find the result fascinating and captivating. Give it a spin on FiniteCurve.com and see what you think!

(Thanks to gnuplot for its ever invaluable plots, both for debugging, and for the illustrations in this post)

Swap on HDD: Does placement matter? (tl;dr: Yes)

Someone was partitioning an ancient laptop’s spinning rust, and asked where they should put their swap. At the start of the drive? At the end? As a swap file? Does it even matter?

While I’m on my 4th generation SSD at this point, I was interested enough to benchmark and find out. I would have used a swap file thinking it wouldn’t matter much, but boy was I wrong!

Test setup

My only remaining functional HDD — after putting two others in the electronics recycling bin — was a Western Digital 2.0TB WD20EARX SATA drive with 64MB cache.

I installed Ubuntu 21.04 Server on it, and partitioned it with 1G boot, 4G swap, 1.8T ext4 root, and another 4G swap. For easy testing, I did it all through kvm using the raw device without cache:

kvm -m 2048 -drive file=/dev/sda,format=raw,cache=none

I verified that the host did indeed not use its rather more generous RAM to cache the device.

The standard test, building Linux, did not swap as much as I hoped. With enough RAM to successfully boot, it didn’t really touch swap for as long as I bothered to watch it. I instead turned to my favorite RAM hog: building ShellCheck with the GHC Haskell compiler. This is a high RAM, low disk process.

I tried once with plenty of RAM, then each of the following with 2GB RAM and 4GB swap:

Swap at the start of the drive
Swap at the end of the drive
Swap in a file created with dd if=/dev/zero of=4g bs=1M count=4096 (which hdparm --fibmap revealed to be allocated across four contiguous areas in the first 14GB of the device)

For fun, I also included:

Swap on an old OCZ Vertex 4 SATA 256GB SSD
Swap on my current Samsung 970 Evo Plus 1TB SSD

Each test was preceded by a git clean -fdx, swapoff/swapon, dropping caches, and finally a time cabal v1-build shellcheck.

I ran each test twice and picked the lowest number, though the variance was not large. For the swap file, I generated a second one to see if it was the luck of the allocator, but results were similar. dmesg showed no read errors after the tests finished.

Results

Bar chart of numbers below

Or in numbers:

165s (2:45) — RAM only
451s (7:31) — NVMe SSD
771s (12:51) — SATA SSD
1701s (28:21) — Start of HDD
2600s (43:20) — Swap file on HDD
3161s (52:41) — End of HDD

In this test, using a swap file was surprisingly 50%+ slower than simply allocating a swap partition at the start of the drive, in spite of the low fragmentation and Linux’s bypass of the FS layer.

Drives and workloads obviously vary, but if nothing else we can safely say that yes, placement matters.

What exactly was the point of [ “x$var” = “xval” ]?

In shell scripting you sometimes come across comparisons where each value is prefixed with "x". Here are some examples from GitHub:

if [ "x${JAVA}" = "x" ]; then
if [ "x${server_ip}" = "xlocalhost" ]; then
if test x$1 = 'x--help' ; then

I’ll call this the x-hack.

For any POSIX compliant shell, the value of the x-hack is exactly zero: this comparison works without the x 100% of the time. But why was it a thing?

Online sources like this stackoverflow Q&A are a little handwavy, saying it’s an alternative to quoting (most definitely NOT the case!), pointing towards issues with "some versions" of certain shells, or generally cautioning against the mystic behaviors of especially ancient Unix system without concrete examples.

To determine whether or not ShellCheck should warn about this, and if so, what its long form rationale should be, I decided to dig into the history of Unix with the help of The Unix Heritage Society‘s archives. I was unfortunately unable to peer into the closely guarded world of the likes of HP-UX and AIX, so dinosaur herders beware.

These are the cases I found that can fail.

Left-hand side matches a unary operator

The AT&T Unix v6 shell from 1973, at least as found in PWB/UNIX from 1977, would fail to run test commands whose left-hand side matched a unary operator. This must have been immediately obvious to anyone who tried to check for command line parameters:

% arg="-f"
% test "$arg" = "-f"
syntax error: -f
% test "x$arg" = "x-f"
(true)

This was fixed in the AT&T Unix v7 Bourne shell builtin in 1979. However, test and [ were also available as separate executables, and appear to have retained a variant of the buggy behavior:

$ arg="-f"
$ [ "$arg" = "-f" ]
(false)
$ [ "x$arg" = "x-f" ]
(true)

This happened because the utility used a simple recursive descent parser without backtracking, which gave unary operators precedence over binary operators and ignored trailing arguments.

The "modern" Bourne shell behavior was copied by the Public Domain KornShell in 1988, and made part of POSIX.2 in 1992. GNU Bash 1.14 did the same thing for its builtin [, and the GNU shellutils package that provided the external test/[ binaries followed POSIX, so the early GNU/Linux distros like SLS were not affected, nor was FreeBSD 1.0.

The x-hack is effective because no unary operators can start with x.

Either side matches string length operator `-l`

A similar issue that survived longer was with the string length operator -l. Unlike the normal unary predicates, this one was only parsed as part as part of an operand to binary predicates:

var="helloworld"
[ -l "$var" -gt 8 ] && echo "String is longer than 8 chars"

It did not make it into POSIX because, as the rationale puts it, "it was undocumented in most implementations, has been removed from some implementations (including System V), and the functionality is provided by the shell", referring to [ ${#var} -gt 8 ].

It was not a problem in UNIX v7 where = took precedence, but Bash 1.14 from 1996 would parse it greedily up front:

$ var="-l"
$ [ "$var" = "-l" ]
test: -l: binary operator expected
$ [ "x$var" = "x-l" ]
(true)

It was also a problem on the right-hand side, but only in nested expressions. The -l check made sure there was a second argument, so you would need an additional expression or parentheses to trigger it:

$ [ "$1" = "-l" -o 1 -eq 1 ]
[: too many arguments
$ [ "x$1" = "x-l" -o 1 -eq 1 ]
(true)

This operator was removed in Bash 2.0 later that year, eliminating the problem.

Left-hand side is `!`

Another issue in early shells was when the left-hand side was the negation operator !:

$ var="!"
$ [ "$var" = "!" ]
test: argument expected            (UNIX v7, 1979)
test: =: unary operator expected   (bash 1.14, 1996)
(false)                            (pd-ksh88, 1988)
$ [ "x$var" = "x!" ]
(true)

Again, the x-hack is effective by preventing the ! from being recognized as a negation operator.

ksh treated this the same as [ ! "=" ], and ignored the rest of the arguments. This quiety returned false, as = is not a null string. Ksh continues to ignore trailing arguments to this day:

$ [ -e / random words/ops here ]
(true)                              (ksh93, 2021)
bash: [: too many arguments         (bash5, 2021)

Bash 2.0 and ksh93 both fixed this problem by letting = take precedence in the 3-argument case, in accordance with POSIX.

Left-hand side is "("

This is by far my favorite.

The UNIX v7 builtin failed when the left-hand side was a left-parenthesis:

$ left="(" right="("
$ [ "$left" = "$right" ]
test: argument expected
$ [ "x$left" = "x$right" ]
(true)

This happens because the ( takes precedence over the =, and becomes an invalid parenthesis group.

Why is this my favorite? Behold Dash 0.5.4 up until 2009:

$ left="(" right="("
$ [ "$left" = "$right" ]
[: 1: closing paren expected
$ [ "x$left" = "x$right" ]
(true)

That was an active bug when the StackOverflow Q&A was posted.

But wait, there’s more!

Here’s Zsh in late 2015, right before version 5.3:

% left="(" right=")"
% [ "$left" = "$right" ]
(true)
% [ "x$left" = "x$right" ]
(false)

Amazingly, the x-hack could be used to work around certain bugs all the way up until 2015, seven years after StackOverflow wrote it off as an archaic relic of the past!

The bugs are of course increasingly hard to come across. The Zsh one only triggers when comparing left-paren against right-paren, as otherwise the parser will backtrack and figure it out.

Another late holdout was Solaris, whose /bin/sh was the legacy Bourne shell as late as Solaris 10 in 2009. However, this was undoubtedly for compatibility, and not because they believed this was a viable shell. A "standards compliant" shell had been an option for a long time before Solaris 11 dragged it kicking and screaming into 21th century — or at least into the 90s — by switching to ksh93 by default in 2011.

In all cases, the x-hack is effective because it prevents the operands from being recognized as parentheses.

Conclusion

The x-hack was indeed useful and effective against several real and practical problems in multiple shells.

However, the value was mostly gone by the mid-to-late 1990s, and the few remaining issues were cleaned up before 2010 — shockingly late, but still over a decade ago.

The last one managed to stay until 2015, but only in the very specific case of comparing opening parenthesis to a closed parenthesis in one specific non-system shell.

I think it’s time to retire this idiom, and ShellCheck now offers a style suggestion by default.

Epilogue

The Dash issue of [ "(" = ")" ] was originally reported in a form that affected both Bash 3.2.48 and Dash 0.5.4 in 2008. You can still see this on macOS bash today:

$ str="-e"
$ [ \( ! "$str" \) ]
[: 1: closing paren expected     # dash
bash: [: `)' expected, found ]   # bash

POSIX fixes all these ambiguities for up to 4 parameters, ensuring that shells conditions work the same way, everywhere, all the time.

Here’s how Dash maintainer Herbert Xu put it in the fix:

/*
 * POSIX prescriptions: he who wrote this deserves the Nobel
 * peace prize.
 */

Why is /usr/bin/test 4kiB smaller than /usr/bin/[ ?

Reddit user mathisweirdaf posted this interesting observation:

 $ ls -lh /usr/bin/{test,[}
-rwxr-xr-x 1 root root 59K  Sep  5  2019 '/usr/bin/['
-rwxr-xr-x 1 root root 55K  Sep  5  2019  /usr/bin/test

[ and test are supposed to be aliases for each other, and yet there is a 4kiB difference between their GNU coreutils binaries. Why?

First, for anyone surprised: yes, there is a /usr/bin/[. I have a previous post on this subject, but here’s a quick recap:

When you write if [ -e /etc/passwd ]; then .. that bracket is not shell syntax but just a regular command with a funny name. It’s serviced by /usr/bin/[, or (more likely) a shell builtin. This explains a lot of its surprising behavior, e.g. why it’s notoriously space sensitive: [1=2] is no more valid than ls-l/tmp.

Anyways, why is there a size difference? We can compare objdump output to see where the data goes. Here’s an excerpt from objdump -h /usr/bin/[:

                 size                                          offset
15 .text         00006e82  0000000000002640  0000000000002640  00002640  2**4
16 .fini         0000000d  00000000000094c4  00000000000094c4  000094c4  2**2
17 .rodata       00001e4c  000000000000a000  000000000000a000  0000a000  2**5

and here’s objdump -h /usr/bin/test:

15 .text         000068a2  0000000000002640  0000000000002640  00002640  2**4
16 .fini         0000000d  0000000000008ee4  0000000000008ee4  00008ee4  2**2
17 .rodata       00001aec  0000000000009000  0000000000009000  00009000  2**5

We can see that the .text segment (compiled executable code) — is 1504 bytes larger, while .rodata (constant values and strings) is 864 bytes larger.

Most crucially, the increased size of the .text segment causes it to go from the 8000s to the 9000s, crossing a 0x1000 (4096) page size boundary, and therefore shifting all other segments up by 4096 bytes. This is the size difference we’re seeing.

The only nominal difference between [ and test is that [ requires a ] as a final argument. Checking for that would be a very minuscule amount of code, so what are those ~1500 bytes used for?

Since it’s hard to inspect stripped binaries, I built my own copy of coreutils and compared the list of functions in each:

$ diff -u <(nm -S --defined-only src/[ | cut -d ' ' -f 2-) <(nm -S --defined-only src/test | cut -d ' ' -f 2-)
--- /dev/fd/63      2021-02-02 20:21:35.337942508 -0800
+++ /dev/fd/62      2021-02-02 20:21:35.341942491 -0800
@@ -37,7 +37,6 @@
 D __dso_handle
 d _DYNAMIC
 D _edata
-0000000000000099 T emit_bug_reporting_address
 B _end
 0000000000000004 D exit_failure
 0000000000000008 b file_name
@@ -63,7 +62,7 @@
 0000000000000022 T locale_charset
 0000000000000014 T __lstat
 0000000000000014 t lstat
-0000000000000188 T main
+00000000000000d1 T main
 000000000000000b T make_timespec
 0000000000000004 d nslots
 0000000000000022 t one_argument
@@ -142,16 +141,10 @@
 0000000000000032 T umaxtostr
 0000000000000013 t unary_advance
 00000000000004e5 t unary_operator
-00000000000003d2 T usage
+0000000000000428 T usage
 0000000000000d2d T vasnprintf
 0000000000000013 T verror
 00000000000000ae T verror_at_line
-0000000000000008 D Version
-00000000000000ab T version_etc
-0000000000000018 T version_etc_ar
-000000000000042b T version_etc_arn
-000000000000002f R version_etc_copyright
-000000000000007a T version_etc_va
 000000000000001c r wide_null_string.2840
 0000000000000078 T x2nrealloc
 000000000000000e T x2realloc

The major contributors are the version_etc* functions. What do they do?

Well, let’s have a look:

/* The three functions below display the --version information the
   standard way. [...]

These are 260 lines of rather elaborate, internationalized, conditional ways of formatting data that makes up --version output. Together they take about bc <<< "ibase=16; 7A+2F+42B+18+AB+8+99" = 1592 bytes.

What does this mean? Simple. This is what we’re paying an extra 4kb for:

$ /usr/bin/[ --version
[ (GNU coreutils) 8.30
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Kevin Braunsdorf and Matthew Bradburn.

[ --version is missing the final ], so the invocation is invalid and the result is therefore implementation defined. GNU is free to let it output version info.

Meanwhile, /usr/bin/test --version is a valid invocation, and POSIX mandates that it simply returns success when the first parameter (--version) is a non-empty string.

This difference is even mentioned in the documentation:

NOTE: [ honors the --help and --version options, but test does not.
test treats each of those as it treats any other nonempty STRING.

Mystery solved!

(Exercise: what would have been the implications of having test support --help and --version in spite of POSIX?)