In regular expressions for regular languages, is there distinction between eager and lazy matching?

72 Views Asked by At

In regex, matching is eager by default, and lazy by ?. For example *abc and *?abc.

In regular expressions for regular languages, is there distinction between eager and lazy matching? I have two conflict guesses:

  • I guess no, and is the distinction an implementation detail of FAs which recognize the regular languages denoted by the regular expressions?

  • I guess that the definition of Kleene star operator could be eager.

Thanks.

1

There are 1 best solutions below

0
On

In the formal language theoretical model, there is no concept of greediness of repetition. A sentence either matches or does not match, and the DFA does not involve any retention of information about how the sentence matched; that is, which subsequences of the sentence were matched by which subexpressions of the regular expression. (This is not an artefact of the DFA construction. The NFA construction is also lacking in structural information, and remember that the N in NFA is an abbreviation for Non-deterministic.)

You could convert the matching automaton into a regular (rational) grammar, and record the derivation used to parse the sentence. From the derivation you can derive a parse tree. But then grammar is likely to be ambiguous, grammars, like NFAs, are essentially non-deterministic. (Indeed, you can consider a rational grammar to just be another way to write an NFA; the two formalisms are homomorphic.)

Practical regex libraries are, however, used to decompose inputs, not just to recognise whether an input matches. The decomposition requires some kind of disambiguation (not necessarily complete), and that's provided by markings inside "regexes", notably "captures" and "greediness". (And other mechanisms as well, in some libraries.) These mechanisms cannot be transformed into Type 3 or even Type 2 grammars. In some regex libraries, they may even interfere with the recognition of sentences, so that not all matching sentences are actually matched.

There is some literature which attempts to formalise this process, but it's more or less on the margins of formal language theory (although there are probably lots of problems which would make for interesting studies). What seems clear is that linear-time parsing is not generally possible. Moreover, if the non-deterministic alternation operator is replaced with an ordered-choice operator, then some non-context-free languages can be recognised (possibly requiring non-polynomial time).

Greediness markings by themselves, if they don't restrict the underlying non-determinism, are not as powerful as ordered-choice but they do impede linear-time parsing (and I suspect they make it impossible).

In short, regular (rational) expressions are not equivalent to the pattern languages of common regex libraries. They do not serve the same purpose; they do not respond to the same algorithms and they do not have the same computational complexity. In a very real sense, they belong to different domains, and one should avoid the temptation to be misled by the similarity of names.