Are context free languages closed under taking substring?

1.8k Views Asked by At

Let $L$ be context free and $\Sigma $ an alphabet

Define $s(L):=\{y \in \Sigma ^*\mid \exists x,z \in \Sigma ^*: xyz \in L\}$

Is $s(L)$ context free ?

I haven't been able to find a counterexample so im thinking i have to prove it.

I was trying to use the closure properties. $s(L)$ obviously contains $L$ so maybe i can write $s(L)$ as the union of several context free languages.

Could i please get some help, im stuck.

5

There are 5 best solutions below

0
On

Hint: given a CFG grammar $G$ for $L$, construct a grammar $G'$ that generates all possible substrings.

This involves adding a lot of rules that start in the middle of an original rule, followed by adding a lot of possible suffixes for each rule. It is a bit tricky to get correct, but definitely possible.

0
On

Maybe a PDA is easier or more intuitive than the grammar. Suppose you have a PDA $A$ for $L$. You can then build a new one for s(L) in the following manner:

  1. From the start you run $A$, but not on the input. Rather it guesses nondeterministically some letter that it could have read and acts if as $A$ had read it.

  2. At some point you guess that the substring starts. Now the head actually starts moving and reads the input just as $A$ would do - however, in contrast to the original $A$ the stack is not empty at the beginning.

  3. When all the input is read, you start another "virtual" phase for $A$, reading guessed letters. If you arrive at an accepting configuration, you accept.

The new automaton has an accepting run for every substring of a word from $L$.

0
On

The shortest way to prove this result using closure properties is to use rational transductions, which are known to preserve context-free languages. This is a very powerful method, which also avoids handshaking proofs.

Let $\tau: \Sigma^* \to \Sigma^*$ be the transduction defined, for every $u \in \Sigma^*$, by $\tau(u) = \Sigma^*u\Sigma^*$. This transduction is rational since its graph $$ \{(u,v) \mid v \in \tau(u) \} = (\Sigma, 1)^* \Bigl(\bigcup_{a \in \Sigma}\ (a,a)\Bigr)^*(1,\Sigma)^* $$ is a rational subset of $\Sigma^* \times \Sigma^*$. Now, it is known that the inverse of a rational transduction is also a rational transduction. The inverse transduction $\tau^{-1}$ is defined by $$ v \in \tau^{-1}(u) \iff u \in \tau(v). $$ Finally, since rational transductions preserve context-free languages, it suffices to observe that $s(L)= \tau^{-1}(L)$.

0
On

Yes, context-free languages are closed under the substring operator.

  1. One way to prove this is to use string homomorphisms.

    • Definition. If $\Sigma$ and $\Gamma$ are two alphabets, then a string homomorphism $h:\Sigma^*\rightarrow \Gamma^*$ is a function that sends each symbol in $\Sigma$ to a word in $\Gamma^*$. By extension, it transforms words in $\Sigma^*$ by transforming each letter separately: $h(a_1\ldots a_n) = h(a_1)h(a_2)\ldots h(a_n)$. (As a special case, $h(\epsilon) = \epsilon$.)

      A string homomorphism applies to a language in the obvious way: $h(L) \equiv \{h(w) : w \in L\}$

      A string homomorphism can also be applied in reverse: $h^{-1}(L) \equiv \{w \in \Sigma^* : h(w) \in L\}$.

    • Closure properties. It turns out that context-free languages are closed under homomorphism and inverse homomorphism:

      if $L$ is a context-free language over the alphabet $\Sigma$, and $h:\Sigma^*\rightarrow \Gamma^*$ is any homomorphism, then $h(L)$ is a context-free language, and $h^{-1}(L)$ is a context-free language.

  2. This closure property can help prove that context-free languages are closed under the substring operator $s$.

    • Suppose $L$ is a context-free language over the alphabet $\Sigma$.

    • Define new alphabets $\color{blue}\Gamma_1$ and $\color{purple}\Gamma_2$ which have the same symbols as $\Sigma$ except they're each painted a distinguishing color (say, $\color{blue}\Gamma_1$ symbols are all blue, and $\color{purple}\Gamma_2$ symbols are all purple).

      (If you want a formal way to define color-coding, you can use $\Gamma_1 \equiv \Sigma \times \{0\}$ and $\Gamma_2 \equiv \Sigma \times \{1\}$.)

    • We'll make use of two homomorphisms:

      There's a color-forgetting homomorphism $\mathsf{forget}:( \color{blue}\Gamma_1 \cup \color{purple}\Gamma_2)^* \rightarrow \Sigma^*$ which simply replaces each colored symbol with its uncolored counterpart in $\Sigma$.

      There's a purple-erasing homomorphism $\mathsf{nopurple}:( \color{blue}\Gamma_1 \cup \color{purple}\Gamma_2)^* \rightarrow ( \color{blue}\Gamma_1 \cup \color{purple}\Gamma_2)^* $ which sends the blue symbols $\color{blue}\Gamma_1$ to themselves, but erases all the purple symbols $\color{purple}\Gamma_2$ by sending each one to the empty string $\epsilon$.

  3. Here is the proof:

    1. Let $L$ be a context-free language.

    2. Then $\mathsf{forget}^{-1}(L)$ is an inverse homomorphism of $L$, so it is context-free. It consists of strings from $L$ where the characters are all arbitrarily colored blue or purple.

    3. The language $R \equiv \color{purple}\Gamma_2^* \color{blue}\Gamma_1^* \color{purple}\Gamma_2^*$ is regular. It consists of all strings of (zero or more) purple characters, then blue characters, then purple characters.

      Because $\mathsf{forget}^{-1}(L)$ is context-free, the language $\mathsf{forget}^{-1}(L) \cap R$ is context-free; after all, the intersection of a context-free and regular language is context-free.

      This language is the set of all strings from $L$ with a purple-colored beginning and end, and a blue-colored middle.

    4. Because $\mathsf{forget}^{-1}(L) \cap R$ is context-free, so are: \begin{align*} &&\mathsf{nopurple}[\mathsf{forget}^{-1}(L) \cap R]\\ &&\mathsf{forget}\left[\mathsf{nopurple}[\mathsf{forget}^{-1}(L) \cap R]\right] \end{align*} because these are homomorphisms of context-free languages. The first one is the set of all substrings of $L$, colored blue. The second one is the set of all substrings of $L$, uncolored—in other words, it's $s(L)$ as you defined it.

      Hence $s(L)$ is context-free whenever $L$ is. $\square$


Proof of closure under homomorphism. If $L$ is context-free, then it is represented by some context-free grammar $G$. Given a homomorphism $h$ form a new grammar $G^\prime$ in which every terminal symbol $a$ is replaced with $h(a)$; this new context-free grammar recognizes $h(L)$, which must therefore be context-free.

Proof of closure under inverse homomorphism. If $L$ is context-free, it is recognized by some pushdown automaton $P$; call its states $Q$ and its transition rule $\delta$.

  1. Given a homomorphism $h$, we'll construct a new pushdown automaton $P^\prime$ that recognizes $h^{-1}(L)$.

  2. Ideally, when our machine reads a character $a$, we would like it to load the value of $h(a)$ into some kind of internal buffer memory. Then we would like it to consume characters from its internal buffer memory, simulating what the original pushdown automaton $P$ would do if it were reading those characters as input. Whenever the buffer becomes empty, the machine reads the next character of real input.

  3. Note that this buffer can only be in finitely many configurations: it can have $h(a)$ in its memory for any symbol $a\in\Sigma$. Plus, it can have any suffix of $h(a)$ because some of the initial characters have been consumed already. For example, if the value of $h(a)$ is $pqrst$, then the buffer might be in state $pqrst, qrst, rst, s, t,$ or $\epsilon$.

    Given a homomorphism $h$, the set of buffer configurations $\mathscr{B}$ is the set $\{h(a) : a \in \Sigma\}$, plus all possible suffixes. It is finite(!).

  4. Because the set of buffer states $\mathscr{B}$ is finite, we can implement a buffer by augmenting the state space.

    Let the state space $Q^\prime$ of the new machine be given by $Q\times \mathscr{B}$. All we need to do is define the new transition rule so that the new machine behaves as required.

  5. If the buffer is empty, the machine should read a new input symbol $a$ and load $h(a)$ into memory. This is represented by transition rules of the form: $$\delta^*(\langle q, \epsilon\rangle, a, X) \rightarrow (\langle q, h(a)\rangle, X) $$ In other words, when the machine is in state $q$ with $\epsilon$ in the buffer and $X$ on the stack, put $h(a)$ in the buffer and put $X$ back on the stack.

  6. If the buffer is not empty, the machine may consume a character from the front of the buffer and simulate what $P$ would do on that input.

    In other words, whenever the original machine has the transition $\delta(q, b, X)\rightarrow (q^\prime, Y)$, then the new machine has the transition

    $$\delta^*(\langle q, b\cdots, \epsilon, X)\rightarrow (\langle q^\prime, \cdots), Y)$$

    for every buffer state $b\cdots$ beginning with $b$. In other words, if the buffer isn't empty, the machine can consume a character from the start of the buffer and simulate what $P$ would do with that character as input. This is an $\epsilon$-transition for $P^\prime$ because it is consuming its buffer, not its real input.

  7. The new machine starts in the same start state as the old machine, with an empty buffer.

    The new machine accepts whenever its buffer is empty and the old machine would accept.

    This completes the description of the pushdown automaton for accepting $h^{-1}(L)$. It follows that $h^{-1}(L)$ is context-free.

0
On

Yes, context-free languages are closed under the substring operator.

Let $L$ be a context-free language. Then there is a pushdown automaton $P$ which recognizes $L$. Denote $P$'s set of states by $Q$ and its transition rule by $\delta$.

We'll define a new pushdown automata $P^\prime$ that will recognize any substring of the language $L$.

  • The basic idea is that $P^\prime$ will contain three copies of the machine $P$: beginning, middle, and end. The first and third copies don't actually read characters from the input. Instead, they nondeterministically pick a character and simulate what $P$ would do if it had just read that character as input.

  • The second copy of $P$ actually works, consuming characters from the input.

  • When initialized, the new machine $P^\prime$ starts with the first machine, which can hand over control to the second machine at any time (by an $\epsilon$-transition), which can hand over the control to the third machine at any time. $P^\prime$ accepts whenever its third machine accepts.

  • In this way, the machine $P^\prime$ accepts all substrings of $L$. The first machine guesses how the string begins (putting symbols onto the stack and popping them off accordingly), the second machine reads the input itself, and the final machine guesses how it ends.

That's the construction; the rest is just details.

  • The states of the old machine are $Q$. Let the states of the new machine be $Q\times \{0,1,2\}$. Hence there are three copies of each state, corresponding to the three copies of the machine. 0 represents the beginning copy, 1 represents the middle copy, and 2 represents the end copy.

  • The initial state in the old machine is $q_{init}$. In the new machine, the initial state is $\langle q_{init}, 0\rangle$.

  • The old machine read characters from the input. The new machine will instead simulate reading characters from the input.

    In particular, whenever the old machine had a transition $$\delta(q, a, X)\rightarrow \langle q^\prime, Y\rangle, $$ the new machine will have an $\epsilon$-transition for the beginning copy (index 0) $$\delta^*(\langle q,0\rangle, \underline\epsilon, X)\rightarrow \rightarrow \langle q^\prime, Y\rangle$$ and for the ending copy (index 2) $$ \delta^*(\langle q,2\rangle, \underline\epsilon, X)\rightarrow \langle q^\prime, Y\rangle.$$ In other words, these machines can nondeterministically decide to simulate what $P$ would do if it read character $a$ from the input.

    The middle copy behaves as normal and actually consumes the input: $$ \delta^*(\langle q,1\rangle, \underline{a}, X)\rightarrow \langle q^\prime, Y\rangle.$$

  • The copies handle $\epsilon$-transitions just like the original machine: when the original machine has $\delta(q, \epsilon, X)\rightarrow (q^\prime, Y)$, the new copies have the equivalent transitions:

    $$\delta^*(\langle q, i\rangle, \epsilon, X)\rightarrow (\langle q^\prime, i\rangle, Y)\qquad i=0,1,2$$

  • The machines may hand off control to one another. The beginning machine hands off control to the middle machine, and the middle hands off control to the end machine:

    $$\delta^*(\langle q, 0\rangle, \epsilon, X)\rightarrow (\langle q, 1\rangle, X)$$ $$\delta^*(\langle q, 1\rangle, \epsilon, X)\rightarrow (\langle q, 2\rangle, X)$$

  • The new machine accepts whenever its ending machine accepts: if $F\subseteq Q$ are the accepting states of the original machine, then $F\times \{2\}$ are the accepting states of the new machine.

    This completes the description.