Regular expressions match not "abc"

219 Views Asked by At

I need to match any latin word NOT containing "abc" using basic regex operations (without syntactic sugar or look-ahead like (?!...)): only +, *, |. Solution for "ab" would be quite trivial, but adding 'c' made me lost in thoughts. Appreciate any help.

1

There are 1 best solutions below

0
On

This is one (of many) cases where it is easier to start with the DFA and create the regular expression afterwards. DFAs are sometimes radically more compact than regular expressions; in this case, you only need one state per position in the string to be excluded.

In cases where the string to be excluded has the property that the first character does not appear again in the string, the solution is fairly straight-forward. The DFA can be drawn by following the following pattern:

Three-state DFA

The regular expression can then be recovered using any standard algorithm for creating regular expressions from state machines.

Note that this regular expression is essentially the same as the epsilon-closure of the machine implicitly built by the Knuth-Morris-Pratt search algorithm with the difference is that KMP is seeking to find the string, whereas here we are seeking to fail if the string is found. (Consequently, all of the states in the machine above are accepting states. I didn't draw the failure state, which would result from a c transition from state 3; that would be the KMP success state.)

So if there is a suffix of the string you need to exclude which overlaps a prefix, you can use the KMP algorithm; otherwise, it's just a simple linear DFA where transitions move left-to-right along the target string and a match of the first character from any state leads to state 2 while any other non-match leads to state 1.

The regex conversion is tedious, and I might have gotten it wrong. I find the easiest approach is usually to delete states one at a time, collapsing transitions through the deleted state into regex transitions. (See this http://cs.stackexchange answer for a better explanation. If I did it right, one regex for the above state diagram is

([b-z]|a(b?a)*([c-z]|b[bd-z]))*((b?a)+b?)?