Check if a regex is ambiguous

2.1k Views Asked by At

I wonder if there is a way to check the ambiguity of a regular expression automatically. A regex is considered ambiguous if there is an string which can be matched by more that one ways from the regex. For example, given a regex $R = (ab)^*(a\mid b)^*$, we can detect that $R$ is an ambiguous regex since there are two ways to match string $ab$ from $R$.

This questions is posted there initially, but I think Mathematics StackExchange is a more suitable place.

Thanks,

1

There are 1 best solutions below

2
On BEST ANSWER

This solution is due to @prateek.

We will inductively decide whether a given regex $\alpha$ is ambiguous.

If $\alpha = \epsilon$, $\alpha = a \in \Sigma$, or $\alpha = \varnothing$, it is unambiguous.

If $\alpha = \beta+\gamma$, then $L(\alpha) = L(\beta) \cup L(\gamma)$, so, $\alpha$ is unambiguous if and only if both $\beta$ and $\gamma$ are unambiguous, and $L(\alpha) \cap L(\beta) \ne \varnothing$. By induction, the former is decidable, and the second part is just checking if some regular language is non-empty, which is again decidable.

If $\alpha = \beta\cdot \gamma$, then again, for $\alpha$ to be unambiguous, it is necessary and sufficient that $\beta$ and $\gamma$ are unambiguous, and that every element of $L(\beta)\cdot L(\gamma)$ factors uniquely. There are other ways to check the last condition, but we can again reduce this to emptyness of a regular language. Consider the following NFA over $\Sigma\times \Sigma \times \{0,1\}$.

We run two copies of the NFA of $\alpha$ on the two factors, and keep track (in the third factor) if we switch from $\beta$ to $\gamma$ at different places. This will have an accepting run on a word iff there are multiple ways to factor the word. Thus, the language is empty iff there is no word with multiple factorizations.

A similar argument works for $\alpha = (\beta)^*$. One needs to be a bit careful depending on whether $\epsilon \in L(\beta)$ and whether $\alpha$ is considered ambiguous in that case.