{"id":36,"date":"2012-04-02T10:47:04","date_gmt":"2012-04-02T10:47:04","guid":{"rendered":"http:\/\/www.vidarholen.net\/contents\/blog\/?p=36"},"modified":"2012-04-02T10:58:14","modified_gmt":"2012-04-02T10:58:14","slug":"technically-correct-inversed-regex","status":"publish","type":"post","link":"https:\/\/www.vidarholen.net\/contents\/blog\/?p=36","title":{"rendered":"Technically correct: Inversed regex"},"content":{"rendered":"<blockquote><p>How do I write a regex that matches lines that do <strong>not<\/strong> contain <code>hi<\/code>?<\/p><\/blockquote>\n<p>That&#8217;s a frequently asked question if I ever saw one.<\/p>\n<p>Of course, the proper answer is: you don&#8217;t. You write a regex that <strong>does<\/strong> match <code>hi<\/code> and then invert the matching logic, ostensibly with <code>grep -v<\/code>. But where&#8217;s the fun in that?<\/p>\n<p>One interesting theorem that pops up in any book or class on formal grammars is that regular languages are closed under complement: the inverse of a regular expression is also a regular expression. This means that writing inverted regular expressions is theoretically possible, though it turns out to be quite tricky <\/p>\n<p>Just try writing a regex that matches strings that does not contain &#8220;hi&#8221;, and test it against &#8220;hi&#8221;, &#8220;hhi&#8221; and &#8220;ih&#8221;, &#8220;iih&#8221; and such variations. Some solutions are coming up. <\/p>\n<p>A way to cheat is using PCRE negative lookahead: <code>^(?!.*foo)<\/code> matches all strings not containing the substring &#8220;foo&#8221;. However, lookahead assertions require a stack, and thus can&#8217;t be modelled as a finite state machine. In other words, they don&#8217;t fit the mathematical definition of a regular expression, and therefore disqualify.<\/p>\n<p>There are simple, well-known algorithms for turning regular expressions into non-deterministic finite automata, and from there to deterministic FA. Less commonly used and known are algorithms for inverting a DFA and for generating familiar textual regex from it. <\/p>\n<p>You can find these described in <a href=\"http:\/\/www.cs.ucf.edu\/courses\/cop3402\/sum2010\/R3%20Regular%20Expressions%20and%20DFAs.ppt\">various<\/a> lecture <a href=\"http:\/\/www.cs.rochester.edu\/u\/nelson\/courses\/csc_173\/fa\/re.html\">notes and slides<\/a>, so I won&#8217;t recite them. <\/p>\n<p>What I had a harder time finding was software that actually did this. So <a href=\"\/contents\/junk\/files\/invert.hs\">here<\/a> is a Haskell program. It&#8217;s highly suboptimal but it does the job. When executed, it will ask for a regex and will then output a grep command that matches everything the regex does not (without <code>-v<\/code>, obviously). <\/p>\n<p>The expressions it produces are quite horrific; it&#8217;s computer generated code, after all. <\/p>\n<p>A regular expression for matching strings that do not match <code>.*hi.*<\/code> could be <code>grep -E '^([^h]|h+$|h+[^hi])*$'<\/code>.<\/p>\n<p>This app, however, suggests <code>grep -E '^([^h]([^h]|)*||([^h][^h]*h|h)|([^h][^h]*(h(hh*[^hi]|[^hi]))|(h(hh*[^hi]|[^hi])))((hh*[^hi]|[^h])|)*|([^h][^h]*(h([^hi][^h]*h|h))|(h([^hi][^h]*h|h)))(([^hi][^h]*h|h)|)*)$'<\/code><\/p>\n<p>It still works exactly as stated though!<\/p>\n<p>The app just supports a small subset of regex, just enough to convince someone that it works, and as a party trick lets you answer the original question exactly as stated. <\/p>\n<p>Technically correct is the best kind of correct. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>How do I write a regex that matches lines that do not contain hi? That&#8217;s a frequently asked question if I ever saw one. Of course, the proper answer is: you don&#8217;t. You write a regex that does match hi and then invert the matching logic, ostensibly with grep -v. But where&#8217;s the fun in &hellip; <a href=\"https:\/\/www.vidarholen.net\/contents\/blog\/?p=36\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Technically correct: Inversed regex&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":"","jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[5,4,23],"tags":[42,41,40],"class_list":["post-36","post","type-post","status-publish","format-standard","hentry","category-advanced-linux","category-linux","category-programming","tag-compsci","tag-regex","tag-technically-correct"],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/www.vidarholen.net\/contents\/blog\/index.php?rest_route=\/wp\/v2\/posts\/36","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.vidarholen.net\/contents\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.vidarholen.net\/contents\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.vidarholen.net\/contents\/blog\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.vidarholen.net\/contents\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=36"}],"version-history":[{"count":0,"href":"https:\/\/www.vidarholen.net\/contents\/blog\/index.php?rest_route=\/wp\/v2\/posts\/36\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.vidarholen.net\/contents\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=36"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.vidarholen.net\/contents\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=36"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.vidarholen.net\/contents\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=36"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}