Regex for strings with at least three unique characters

1.7k Views Asked by At

I'm trying to represent some string conditions in terms of regex. One of those conditions I find hard to transform is that the string must have at least three different characters. So is there any plausible regex for that condition? Suppose the alphabet only contains English characters (a-z A-Z).

Thank you,

2

There are 2 best solutions below

0
On

If your alphabet is $\{a, b, c\}$ only, and you write $A$ for $(a \mid b \mid c)$, then something like $$A^* ( a A^* b A^* c \mid a A^* c A^* b \mid b A^* a A^* c \mid b A^* c A^* a \mid c A^* a A^* b \mid c A^* b A^* a ) A^* $$ will do. If your alphabet had $n$ symbols, this would mean 6 cases for each of the $\binom{n}{3}$ ways of selecting three different symbols.

0
On

If you don't want your regular expression to be that long, you can describe this language as the complement of another one. You can put $L=\overline{L'}$, where $L$ is the requested language and then describe $L'$ with a regular expression, which in your case is $$(a+b)^*+(a+c)^*+(b+c)^*$$ provided the alphabet is $\{a,b,c\}$. This regex describes all words that contain at most two different characters, which is the complementary language of the one that contains words with at least three distinct characters.

This regex has length $\sim n^2$, but it's not a regex for the actual language, rather than its complement. For an alphabet $A=\{a_1,\ldots,a_n\}$ containing $n$ letters, the regex is $$\sum_{i\neq j}(a_i+a_j)^*$$ where the sum is actually for repeated application of "$+$" on the terms $(a_i+a_j)^*$.