{"id":324,"date":"2013-09-30T01:00:16","date_gmt":"2013-09-30T01:00:16","guid":{"rendered":"http:\/\/www.vidarholen.net\/contents\/blog\/?p=324"},"modified":"2014-04-16T16:33:11","modified_gmt":"2014-04-16T16:33:11","slug":"adventures-in-string-reversal","status":"publish","type":"post","link":"https:\/\/www.vidarholen.net\/contents\/blog\/?p=324","title":{"rendered":"Adventures in String Reversal"},"content":{"rendered":"<p>Oh, string reversal! The bread and butter of Programming 101 exams. If I ask you to prove your hacker worth by implementing it in your favorite language, how long would it take you and how many tries will you need to get it right? <\/p>\n<p>Five minutes with one or two tries? 30 seconds and nail it on the first try?<\/p>\n<p>What if I say that this is 2013 and your software can&#8217;t just fail because a user inputs non-ASCII data?<\/p>\n<p>Well&#8230; Java, C#, Python, Haskell and all other modern languages have native Unicode string types, so at most you&#8217;ll just need another minute to verify that it does indeed work, right?<\/p>\n<p>No, you will in fact need several hours and hundreds of lines of code. Reversing a string is much harder than one would think. <\/p>\n<p>The following are cases that a string reversal algorithm could reasonably be expected to handle, but which your initial, naive implementation most likely fails:<\/p>\n<h5>Byte order marks<\/h5>\n<p>Wikipedia says that &#8220;The <a href=\"http:\/\/en.wikipedia.org\/wiki\/Byte_order_mark\">byte order mark<\/a> (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. It is encoded at U+FEFF byte order mark (BOM). BOM use is optional, and, if used, should appear at the start of the text stream.&#8221; <\/p>\n<p>It&#8217;s obviously a bug if the BOM ends up at the end of the string when it&#8217;s reversed. At least that&#8217;s a simple fix, right? <\/p>\n<h5>Surrogate pairs<\/h5>\n<p>Environment based around 16-bit character types, like Java and C#&#8217;s <code>char<\/code> and some C\/++ compilers&#8217; <code>wchar_t<\/code>, had an awkward time when Unicode 2.0 came along, which expanded the number of characters from 65536 to 1114112. Characters in so-called supplementary planes will not fit in a 16-bit <code>char<\/code>, and will be encoded as a surrogate pair &ndash; two <code>char<\/code>s next to each other.<\/p>\n<p>If two chars form a single code point (see e.g. Java&#8217;s <a href=\"http:\/\/docs.oracle.com\/javase\/1.5.0\/docs\/api\/java\/lang\/String.html#codePointAt(int)\"><code>String.codePointAt(int)<\/code><\/a>), reversing them produces an invalid character. <\/p>\n<p>Trashing characters in the string is not a property of correct string reversers. Please fix.<\/p>\n<h5>Composing characters<\/h5>\n<p>While there is a separate character for &#8220;\u00f1&#8221;, n with tilde, it can also be written as two characters: regular &#8220;n&#8221; (U+006E) plus composing tilde (U+0303), which I&#8217;ll write as a regular tilde for illustration.<\/p>\n<p>In this way, you can encode &#8220;pin~a colada&#8221;, and it will render as &#8220;pi\u00f1a colada&#8221;. If the string is trivially reversed, it becomes &#8220;adaloc a~nip&#8221; which will render as &#8220;adaloc \u00e3nip&#8221;. The tilde is now on the wrong character.<\/p>\n<p>Please don&#8217;t shuffle diacritical marks in the input string. Just reverse it.<\/p>\n<p>By the way, if you try to fix this by ensuring that composing characters stay behind their preceding character, you&#8217;ll introduce a regression. <i>Double<\/i> composing characters go <i>between<\/i> the characters they compose. <\/p>\n<p>To put a &#8216;double macron below&#8217; under the characters &#8220;ea&#8221; in &#8220;mean&#8221;, you&#8217;d encode &#8220;me_an&#8221; which renders as &#8220;m<u>ea<\/u>n&#8221;. If you try to reverse it while keeping the macron after the &#8220;e&#8221;, you end up with &#8220;nae_m&#8221; (&#8220;na<u>em<\/u>&#8220;) rather than the original, correct &#8220;na_em&#8221; (&#8220;n<u>ae<\/u>m&#8221;).<\/p>\n<h5>Directional overrides<\/h5>\n<p>What&#8217;s &#8220;hello world&#8221; backwards? It&#8217;s &#8220;hello world&#8221; if your implementation is to be believed. <\/p>\n<p>It happens to be encoded with left-to-right and right-to-left overrides as &#8220;U+202D U+202E dlrow olleh&#8221;. <\/p>\n<p>In this direction, everything from the second character onward is shown right-to-left as &#8220;hello world&#8221;. With trivial reversion, it becomes &#8220;hello world&#8221; followed by a RLO immediately cancelled by a LRO.<\/p>\n<p>Your string reverser doesn&#8217;t actually reverse strings. Would you kindly sort that out?<\/p>\n<p>Obviously, it also has to handle explicit directional embedding, U+202A and U+202B, which are similar but not identical to directional overrides.<\/p>\n<h5>RTL scripts<\/h5>\n<p>Reversal issues occur naturally in bidirectional text. A mix such as &#8220;hello \u05e2\u05d5\u05dc\u05dd&#8221; will render &#8220;hello&#8221; LTR and &#8220;\u05e2\u05d5\u05dc\u05dd&#8221; RTL (the &#8220;\u05dd&#8221; is encoded last, but displays leftmost in that word). When the latin script is first, the string starts from the left margin, with the first encoded character to the left.<\/p>\n<p>If we trivially reverse this string, we get &#8220;olleh \u05dd\u05dc\u05d5\u05e2&#8221; as it starts rendering from the right margin. The first encoded character appears rightmost in the right word, while the last encoded displays rightmost of the leftmost word, i.e. in the middle. <\/p>\n<p>Obviously, &#8220;hello \u05e2\u05d5\u05dc\u05dd&#8221; backwards is &#8220;\u05dd\u05dc\u05d5\u05e2 olleh&#8221;. Please add this to your list. <\/p>\n<h5>Left-to-right and right-to-left markers<\/h5>\n<p>Similarly to the two above cases, the LRM (U+200E) and RLM (U+200F) codes allows changing the direction neutral characters (such as punctuation) are rendered.<\/p>\n<p>&#8220;(RLM)O:&#8221; will be rendered as &#8220;:O&#8221; in the right margin. With trivial string reversal, it will still render as &#8220;:O&#8221;, starting at the left margin.<\/p>\n<p>Didn&#8217;t we already file two bugs about this? <\/p>\n<h5>Pop directional formatting<\/h5>\n<p>Once your kludged and fragile directional string reversal support appears to work reasonably ok, along comes the U+202C Pop Directional Format character. It never ends!<\/p>\n<p>This character undoes the previous explicit directional override, whatever it happened to be. You can no longer try to be clever by splitting the string up into linear sections based on directional markers; you have to go full stack parsing. <\/p>\n<p>Here&#8217;s the <a href=\"http:\/\/www.unicode.org\/reports\/tr9\/\">ten thousand word specification<\/a> of the Unicode directionality algorithm. Have fun.<\/p>\n<h5>Interlinear annotations<\/h5>\n<p>Even if you give up and add a TODO to handle directionality, you still have some cases to go. In logographic languages like Chinese and Japanese, you can have pronunciation guides, so called <a href=\"http:\/\/en.wikipedia.org\/wiki\/Ruby_character\">ruby characters<\/a>, alongside the text. <\/p>\n<p>If your browser supports it, here&#8217;s an example: <ruby>\u6f22<rt>kan<\/rt>\u5b57<rt>ji<\/rt><\/ruby>. <\/p>\n<p>To support this in plain text, Unicode has U+FFF9 through U+FFFB, the Interlinear Annotation Anchor, Separator and Terminator characters respectively. The above could be encoded as &#8220;U+FFFF9 \u6f22\u5b57 U+FFFA kan U+FFFA ji U+FFFB&#8221;.<\/p>\n<p>Reversing the anchor and terminating characters is obviously a bug. <\/p>\n<p>Your string reverser produces garbled output instead of a reversed string&#8230; Is it going to be much longer?<\/p>\n<p>Note that reversing just the contents is still wrong. Instead of correctly annotating &#8220;\u5b57\u6f22&#8221; with &#8220;ij nak&#8221;, you&#8217;d be annotating &#8220;ij&#8221; with &#8220;nak \u5b57\u6f22&#8221;. <\/p>\n<p>Once you&#8217;ve correctly handled this case, try it again when you have an excess of separators at the end of the ruby text. Normally, these would just be ignored, but if you reversed them and put them in front, they&#8217;ll push all ruby characters away from where they were supposed to be.<\/p>\n<p>For &#8220;U+FFFF9 \u6f22\u5b57 U+FFFA kan U+FFFA ji U+FFFA U+FFFB&#8221;, instead of <ruby>\u5b57<rt>ij<\/rt>\u6f22<rt>nak<\/rt><\/ruby> you&#8217;d get <ruby>\u5b57\u6f22<rt>ij<\/rt><rt>nak<\/rt><\/ruby>.<\/p>\n<p>(<i>Update:<\/i> Commenter Jim convincingly argues that you&#8217;d want to reverse the ruby logograph groups but not the characters themselves, resulting in <ruby>\u5b57<rt>ji<\/rt>\u6f22<rt>kan<\/rt><\/ruby> )<\/p>\n<p>Like with the composing characters, your string reversal shuffles ruby characters around. Please&#8230; oh, why bother. <\/p>\n<h5>Conclusion<\/h5>\n<p>Your implementation most likely had half a dozen bugs. Maybe string reversal is beyond your abilities? Join the club!<\/p>\n<p>Hopefully you had fun anyways.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Oh, string reversal! The bread and butter of Programming 101 exams. If I ask you to prove your hacker worth by implementing it in your favorite language, how long would it take you and how many tries will you need to get it right? Five minutes with one or two tries? 30 seconds and nail &hellip; <a href=\"https:\/\/www.vidarholen.net\/contents\/blog\/?p=324\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Adventures in String Reversal&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":"","jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[23],"tags":[47],"class_list":["post-324","post","type-post","status-publish","format-standard","hentry","category-programming","tag-unicode"],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/www.vidarholen.net\/contents\/blog\/index.php?rest_route=\/wp\/v2\/posts\/324","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.vidarholen.net\/contents\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.vidarholen.net\/contents\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.vidarholen.net\/contents\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.vidarholen.net\/contents\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=324"}],"version-history":[{"count":35,"href":"https:\/\/www.vidarholen.net\/contents\/blog\/index.php?rest_route=\/wp\/v2\/posts\/324\/revisions"}],"predecessor-version":[{"id":361,"href":"https:\/\/www.vidarholen.net\/contents\/blog\/index.php?rest_route=\/wp\/v2\/posts\/324\/revisions\/361"}],"wp:attachment":[{"href":"https:\/\/www.vidarholen.net\/contents\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=324"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.vidarholen.net\/contents\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=324"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.vidarholen.net\/contents\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=324"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}