How to break down a problem while constructing a CFG for a language?

258 Views Asked by At

A problem I came across was:

Design a CFG for the language $\{a^ib^jc^k\,|\,i=j+k \}$

The solution I came up with :

$S\rightarrow aSc\,|\,S_1$
$S\rightarrow aS_1b\,|\,\epsilon$

It took more than an hour to come up with solution and there are many more to solve.
What is the best way to tackle such problems?
How would you break this problem into a word problem?

EDIT: How I broke it down was :

  • $a$ is followed by $b$, $b$ is followed by $c$
  • $a$ = $b$ for $k$ = $0$ OR $a>b,c$ for $j,k>0$
2

There are 2 best solutions below

7
On BEST ANSWER

Here I would notice that if $b$ and $c$ were the same symbol, this language would be just $\{a^kb^k:k\ge 0\}$, which is generated by $S\to aSb\mid\epsilon$. However, $b$ and $c$ are not the same, so somehow instead of generating $b$s to the right of the $S$, I need to generate a mixture of $b$s and $c$s. If I didn’t care about their order, I could replace $S\to aSb$ by $S\to aSX$ and add the productions $X\to b\mid c$.

Unfortunately, I do care about their order: I have to generate $c$s for a while, and then $b$s. That suggests replacing $S$ with two new non-terminals, say $C$ generating $c$s and $B$ generating $b$s, with a switch from $C$ to $B$ at some point:

$$\begin{align*} &S\to aCc\mid aBb\mid\epsilon\\ &C\to aCc\mid aBb\mid\epsilon\\ &B\to aBb\mid\epsilon\;. \end{align*}$$

Now I realize that I can simplify this a bit: $S$ and $C$ have the same productions and might as well be the same non-terminal.

$$\begin{align*} &S\to aSc\mid aBb\mid\epsilon\\ &B\to aBb\mid\epsilon\;. \end{align*}$$

1
On

I don't think there is a general approach. YMMV, but three things that helps me a lot are:

  • You can think of CFLs as a complicated multi-parenthesis-languages. Finding this left-paren to right-paren corresponcence gives a sense of structure of that language.

    In your case $\mathtt{a}$ plays the role of $(_c$ and $(_b$, while $\mathtt{b}$ and $\mathtt{c}$ are $)_b$ and $)_c$.
    To put it differently, your language is a language of expressions looking like this: $$(_c(_c\cdots(_c\quad(_b(_b\cdots(_b \quad\bullet\quad)_b)_b\cdots)_b\quad)_c)_c\cdots)_c.$$

    Now it is obvious that you need a two-step approach, \begin{align}S_1 &\to (_c\ S_1\ )_c \mid S_2\\S_2 &\to (_b\ S_2\ )_b \mid \bullet\end{align}

  • Second approach is that CFLs are recognized by (non-deterministic) one-stack automata. Moreover, in many cases they degenerate into one-counter automata (i.e. with a stack where it doesn't matter what letters are pushed or popped). So you try to imagine going over the word and keeping one counter: what value will be necessary to be able to properly recognize the rest?

    To parse a word from your language you need to know the number of letters $\mathtt{b}$ and $\mathtt{c}$ necessary to balance $\mathtt{a}$'s. In other words, while reading $\mathtt{a}$'s you increase the counter, and then decrease it with each $\mathtt{b}$ and later $\mathtt{c}$'s.

    Then, to construct a grammar, you need to transfer somehow that knowledge about incrementing the counter from left side of the word to the right side of the word. You won't be able to do it if these part will go independent like $S \to AB$ (here $A$ and $B$ are two unrelated non-terminals that may correspond to anything), so you need to connect them somehow, e.g. $S \to L\ S\ R \mid \ldots$ and then make sure that $L$ corresponds to increasing a counter, and then $R$ corresponds to decreasing a counter, say $L \to \mathtt{a}$ and $R \to \mathtt{b} \mid\mathtt{c}$. Then, you observe that $\mathbb{b}$'s and $\mathbb{c}$'s has to be in order, and after you simplify it, you arrive again at $S_1 \to \mathtt{a}S_1\mathtt{c} \mid S_2$, etc.

  • Third things that helps me is the homomorphism and inverse-homomorphism properties of CFLs (e.g. see here).

    In your example it doesn't help much, but these are quite powerful tools. You can transform some intangible cases into well known languages. Intuitively you can think of them as in-grammar substitution of terminals. For example if you would take the language from the first bullet and its grammar \begin{align}S_1 &\to \mathtt{(_c}\ S_1\ \mathtt{)_c} \mid S_2\\S_2 &\to \mathtt{(_b}\ S_2\ \mathtt{)_b} \mid \bullet\end{align} and transform it with the following homomorphism $\phi : \big\{\mathtt{(_c},\mathtt{)_c},\mathtt{(_b},\mathtt{)_b},\mathtt{\bullet}\big\} \to \big\{\mathtt{a},\mathtt{b},\mathtt{c}\big\}$ defined as \begin{align} \phi(\mathtt{(_c}) &= \mathtt{a} & \phi(\mathtt{)_c}) &= \mathtt{c} \\ \phi(\mathtt{(_b}) &= \mathtt{a} & \phi(\mathtt{)_b}) &= \mathtt{b} \\ \phi(\bullet) &= \varepsilon \end{align} then you would arrive at precisely your solution (i.e. the grammar of your language).

Finally, one more thing, it doesn't really help with creating grammars, but frequently helps with proving something is not a context-free language:

  • Intersection with a regular language, that is, $L_1$ being CFL and $R_2$ regular implies that $L_3 = L_1 \cap R_2$ is CFL. In other words, if $L_3$ is not a CFL (but $R_2$, e.g. created by you, is regular), then $L_1$ could not have been context-free.

If you would like to see more examples on the usage of the last two techniques, see this answer.

I hope this helps $\ddot\smile$