Archive

Archive for April, 2012

Technically correct: Inversed regex

April 2nd, 2012

How do I write a regex that matches lines that do not contain hi?

That’s a frequently asked question if I ever saw one.

Of course, the proper answer is: you don’t. You write a regex that does match hi and then invert the matching logic, ostensibly with grep -v. But where’s the fun in that?

One interesting theorem that pops up in any book or class on formal grammars is that regular languages are closed under complement: the inverse of a regular expression is also a regular expression. This means that writing inverted regular expressions is theoretically possible, though it turns out to be quite tricky

Just try writing a regex that matches strings that does not contain “hi”, and test it against “hi”, “hhi” and “ih”, “iih” and such variations. Some solutions are coming up.

A way to cheat is using PCRE negative lookahead: ^(?!.*foo) matches all strings not containing the substring “foo”. However, lookahead assertions require a stack, and thus can’t be modelled as a finite state machine. In other words, they don’t fit the mathematical definition of a regular expression, and therefore disqualify.

There are simple, well-known algorithms for turning regular expressions into non-deterministic finite automata, and from there to deterministic FA. Less commonly used and known are algorithms for inverting a DFA and for generating familiar textual regex from it.

You can find these described in various lecture notes and slides, so I won’t recite them.

What I had a harder time finding was software that actually did this. So here is a Haskell program. It’s highly suboptimal but it does the job. When executed, it will ask for a regex and will then output a grep command that matches everything the regex does not (without -v, obviously).

The expressions it produces are quite horrific; it’s computer generated code, after all.

A regular expression for matching strings that do not match .*hi.* could be grep -E '^([^h]|h+$|h+[^hi])*$'.

This app, however, suggests grep -E '^([^h]([^h]|)*||([^h][^h]*h|h)|([^h][^h]*(h(hh*[^hi]|[^hi]))|(h(hh*[^hi]|[^hi])))((hh*[^hi]|[^h])|)*|([^h][^h]*(h([^hi][^h]*h|h))|(h([^hi][^h]*h|h)))(([^hi][^h]*h|h)|)*)$'

It still works exactly as stated though!

The app just supports a small subset of regex, just enough to convince someone that it works, and as a party trick lets you answer the original question exactly as stated.

Technically correct is the best kind of correct.

Advanced Linux-related things, Linux, Programming , ,