I want to find a regular expression that defines a language L over the alphabet {0,1} with a following condition: every word contains exactly two 000 substrings. For example, a valid word would be 1010001010001111, but not 101000011 (even though there are two 000 substrings - 0000 and 0000).
Now, a language that has exactly one 000 substring over the {0,1} alphabet is, if I'm not mistaken: (1 + 01 + 001)* 000 (1 + 10 + 100)* (which equals some X, for example).
Would the solution to the first problem be X 1 X? Or am I missing something? Any constructive input would be greatly appreciated.
First I try a definition of what it means for a word to have $k$ substrings $s$:
Then I try to specify the language with no substring $s=0^3$: $$ L_0 = (1|01|001)^*(\epsilon|0|00)(1|10|100)^* $$
Now I try to add the substring at all places of $L_0$ and come up with a combined expression:
$$ L_1 = (1|01|001)^*(0^3|0^4|0^5)(1|10|100)^* $$ contains all words which contain $1$ substring $0^3$.
Finally I try to add another $s$ at every place: \begin{align} L_2 = & ((1|01|001)^*(0^6|0^7|0^8)(1|10|100)^*)|\\ & ((1|01|001)^*(0^3|0^4|0^5)(1|10|100)^*1(0^3|0^4|0^5)(1|10|100)^*) \end{align} contains all words which contain $2$ substrings $0^3$.
Testing:
I wrote some Ruby code (link) to test the above regular expressions.
It generates random strings and checks if the above regular expressions give the same results as some alternative test methods.
This is not a proof, but gives some confidence that the results hold.
Try it. If you change tiny bits of the regexps you will notice the differences: