Can a polynomial size CFG over large alphabet describe a language, where each terminal appears even number of times?

975 Views Asked by At
  1. Can a CFG over large alphabet describe a language, where each terminal appears even number of times?

  2. If yes, would the Chomsky Normal Form be polynomial in |Σ| ?

EDIT: What about a language where each terminal appears 0 or 2 times?

EDIT: Solving this may give interesting extension to "Restricted read twice BDDs and context free grammars"

Find a context-free grammar (CFG) for the language L of all words such that each terminal in a word occurs even number of times over a possibly large alphabet Σ

My long aproach is (the only nonterminal is S):

S ⟶ ε | SS

x ∈ Σ : S ⟶ xSx

x,y ∈ Σ : S ⟶ xxSyy | yySxx | xySxy | xySyx | yxSyx | yxSxy

Second try, incremental approach.

S ⟶ ε | SS
Terminal productions (suboptimal): x,y ∈ Σ : S ⟶ xx | yy |xxyy | yyxx | xyxy | xyyx | yxyx | yxxy
Incremental: x ∈ Σ : S ⟶ SxSxS

By the way, the homework is over now.

Technically it wasn't a homework assigned to me. I don't know the answer and thought the question was easy. Best.

EDIT: I didn't award the bounty :-)

5

There are 5 best solutions below

10
On

There is a deterministic finite state machine that accepts a word if and only if each terminal appears an even number of times. Assume that there are $k$ symbols in the alphabet. The machine has $2^k$ states, each of which is associated with a subset of the alphabet, and we identify the states with the subsets. The machine is defined so that:

  • The initial state corresponds to the empty set, and this is the only accepting state.
  • The transition rules are defined as follows. From state $A$, on input $l$, if $l \in A$ then let $B = A - \{l\}$ and move to state $B$. Otherwise let $B = A \cup \{l\}$ and move to state $B$.

The idea is that a state $B$ corresponds to the set of terminals that have appeared an odd number of times so far. Being in state $\emptyset$ corresponds to every terminal appearing an even number of times.

Since the language is accepted by a DFSM, the language is regular. This means that there is, trivially, a CFG that accepts the language. The grammar has one nonterminal for each state of the machine, and the rules of the grammar match up with transitions of the machine. I don't know yet whether there is a context free grammar for this language that is polynomial size compared to the size of the alphabet, though.

7
On

Edit: This solution is wrong, as explained in the comments below. The proof breaks in the italicized phrase.

Let $\Sigma = \{ \sigma_i \}$. Take the following grammar of size $O(|\Sigma|)$: $$S \longrightarrow \sigma_i S \sigma_i S | \varepsilon$$ Clearly in every word produced, every terminal appears an even number of times (proof by induction). In order to generate a word $w$ matching this description, suppose that $w = \sigma_i w'$. Then $\sigma_i$ must appear again in $w'$, say $w = \sigma_i z_1 \sigma_i z_2$. Note that $z_1,z_2$ also match the description.

Now in CNF: $$\begin{align*} S &\longrightarrow \varepsilon \\ S &\longrightarrow S_i S_i \\ S_i &\longrightarrow \sigma_i | A_i T \\ A_i &\longrightarrow \sigma_i \\ T &\longrightarrow S_i S_i \end{align*}$$

4
On

Edit: Here is a more verbose proof. The original proof is retained below.

Introduction

Let $\Sigma$ be some alphabet with $n$ symbols. Let $L$ be the language of words in which each symbol appears an even number of times. Suppose $G$ is a grammar for $L$. Our goal is to given an exponential lower bound on the size of $G$ in terms of $n$.

Transition to Chomsky Normal Form

According to Wikipedia (and not difficult to prove), there is a grammar $G'$ for $L$ in Chomsky Normal Form such that $|G'| = O(|G|^2)$. We will show an exponential lower bound on $|G'|$, and this will imply an exponential lower bound on $|G|$ (if $|G'| = \Omega(c^n)$ for $c > 1$ then $|G| = \Omega(\sqrt{c}^n)$).

For our purposes, a grammar is in Chomsky Normal Form (CNF) if the productions are of the following two types: $A \rightarrow BC$ and $A \rightarrow a$, where $A,B,C$ are non-terminals and $a$ is terminal. Some people also place the restriction that $B,C$ cannot equal the starting symbol $S$, but we do not need this condition.

Word derivation in CNF grammars

Let $w \in L$. Choose some derivation of $w$ in $G'$ (there might be several sharing the same derivation tree). The derivation can be thought of as a binary tree (derivation tree) where each non-leaf node is labeled by a terminal, and each leaf is labeled by a non-terminal; a non-leaf node either has two non-leaf children or one leaf child.

As an example, consider the derivation $$S \rightarrow AB \rightarrow ACD \rightarrow^* xyz.$$ The corresponding derivation tree can be described as follows (sorry for the missing graphics):

  1. The root is labeled $S$.
  2. The root has two children labeled $A,B$.
  3. The node labeled $B$ has two children labeled $C,D$.
  4. The node labeled $A$ has one child labeled $x$, which is a leaf.
  5. The node labeled $C$ has one child labeled $y$, which is a leaf.
  6. The node labeled $D$ has one child labeled $z$, which is a leaf.

Note that in general, several nodes can share a label (indeed, this is how one proves the pumping lemma).

Each node in the tree generates a subword of the original word, which is formed from its descendant leaf nodes in order. Continuing our example:

  1. The root generates the entire word $xyz$.
  2. The node labeled $A$ generates the subword $x$.
  3. The node labeled $B$ generates the subword $yz$.
  4. The node labeled $C$ generates the subword $y$.
  5. The node labeled $D$ generates the subword $z$.

A subword lemma

Lemma. Suppose $w \in L$ is of length $m$, and choose some positive $\ell \leq m$. Then there is a node in the derivation tree of $w$ that generates a subword of length $\alpha$ satisfying $\ell \leq \alpha < 2\ell$.

Proof. Run the following algorithm:

  1. Set $v$ to the root of the derivation tree.
  2. While the subword generated by $v$ has size at least $2\ell$, replace $v$ by its child whose generated subword is longest.
  3. Output the subword generated by $v$.

Note that the node $v$ always has two children, since otherwise it generated a subword of length $1$. If the loop in step (2) never runs then certainly $v$ generates a subword of the requisite length, by our assumption that $\ell \leq m$.

Otherwise, consider the last iteration of the loop. Denote by $w(u)$ the subword generated by a node $u$. In the last iteration of the loop, there is a node $v$ with $|w(v)| \geq 2\ell$, with children $v_1,v_2$ such that $|w(v_1)| \geq |w(v_2)|$; we move to $v_1$. Note that $$2\ell \leq |w(v)| = |w(v_1)| + |w(v_2)| \leq 2|w(v_1)|$$ so that $|w(v_1)| \geq \ell$. Since the loop stopped, $|w(v_1)| < 2\ell$ and $w(v_1)$ is of the requisite size.

The proof

We consider the set $P$ of words in $L$ of the form $\pi(\Sigma) \pi(\Sigma)$, where $\pi$ runs over all permutations of $n$ symbols. For example, if $\Sigma = \{0,1\}$ then $$P = \{0101,1010\}.$$ Note that indeed $P \subseteq L$.

Since all words in $P$ are of length $2n$ (recall $n$ is the size of the alphabet $\Sigma$), using the lemma with $\ell = n/3$ (assume for simplicity that $n$ is divisible by $3$) we get that some node in the derivation tree of any $w \in P$ generates a subword $x(w)$ whose size satisfies $$n/3 \leq |x(w)| < 2n/3.$$ Denote by $s(w)$ the non-terminal generating $x(w)$ (it is the label of the node).

Since $|x(w)| \leq n$, all symbols in $x(w)$ are distinct: indeed, any subword of $w$ of length $n$ is a cyclic rotation of the original permutation (think of all subwords of length $3$ in $abcabc$: $abc,bca,cab,abc$).

Our strategy now will be to show that any single non-terminal can generate only a relatively small number of the $x(w)$; since $P$ is big, it will follow that $G'$ must have lots of non-terminals. More explicitly, let $$S = \{ s(w) : w \in P\}.$$ We can find a list of representative, distinct words $w_1,\ldots,w_{|S|}$ such that $$S = \{ s(w_i) : 1 \leq i \leq |S| \}.$$ We will show that for each $w_i$, the set of words $w \in P$ with $s(w) = s(w_i)$ consists of at most $B$ words ($B$ will be determined below). It will follow that the number of distinct non-terminals is at least $$\frac{|P|}{B} = \frac{n!}{B}.$$

So let $w \in P$, and suppose that $s(w) = s(w_i)$ for some fixed $i$. What this means is that we can replace $x(w)$ in $w$ with $x(w_i)$ and still get a word in $L$. The word $w$ has a decomposition $w = lx(w)r$. Every symbol appearing in $x(w)$ appears exactly once in $lr$, and every other symbol appears exactly twice in $lr$. Note that this defines the word $x(w)$ up to permutation, i.e. since we assumed that $lx(w_i)r \in L$, then $x(w_i)$ must be a permutation of $x(w)$ (more accurately, the set of (unique) symbols occurring in $x(w),x(w_i)$ must be the same).

We will now bound the number of words in $P$ that have a subword of length $x(w_i)$ which is a permutation of $x(w_i)$. Any such word $w$ can be broken as $lyzr$ where $yz$ is a permutation of $x(w_i)$ and $|ly|=|zr|=n$; in fact, since $w \in P$ then $ly = zr$. There are $|x(w_i)|!$ possible choices for $yz$. Together, these define $|x(w_i)|$ points of the permutation $\pi(\Sigma) = ly$. For the rest of the points we have $(n-|x(w_i)|)!$ possible choices. Finally, $yz$ can start in at most $n$ "cyclic" locations inside $\pi$, and so $$B \leq n|x(w_i)|!(n-|x(w_i)|)!.$$ When is this expression maximized, as a function of $\alpha = |x(w_i)|$? Taking the logarithm of the bound, $$\log B \leq \log 2n + \log \alpha! + \log (n-\alpha)!.$$ The factorial function (in fact, the Gamma function) is log-convex, and so subject to the condition $n/3 \leq \alpha \leq 2n/3$, this expression is maximized for $\alpha=n/3$ or $\alpha =2n/3$ (the value is the same). That is $$B \leq n(n/3)!(2n/3)!.$$

Plugging the value of $B$ that we have laboriously estimated, we conclude that the number of different non-terminals in $G'$ must be at least $$\frac{|P|}{B} = \frac{n!}{n(n/3)!(2n/3)!} = \frac{1}{n} \binom{n}{n/3}.$$ Now we will use Stirling's approximation $$m! \sim \sqrt{2\pi m} (m/e)^m.$$ Plugging the approximation, we get $$\frac{n!}{2n(n/3)!(2n/3)!} \sim \frac{\sqrt{2\pi n}(n/e)^n}{2n\sqrt{2\pi(n/3)}(n/3e)^{n/3} \sqrt{2\pi(2n/3)}(2n/3e)^{2n/3}}.$$ All powers of $n$ and $e$ cancel, and we're left with the following $$\frac{n!}{n(n/3)!(2n/3)!} \sim \frac{1}{n\sqrt{4\pi n/9}} 3^{n/3} (3/2)^{2n/3} = \Omega\left(n^{-3/2} c^n \right),$$ where the constant $c$ is $$c = 3^{1/3} (3/2)^{2/3} = \frac{3}{2^{2/3}} > 1.$$

In particular, the original grammar $G$ must be of size at least $\Omega(n^{-3/4} \sqrt{c}^n)$.

Exactly 0 or 2 occurrences

Another questioned concerned the language $L'$ of all words which contain exactly $0$ or $2$ occurrences of each symbol in $\Sigma$. The proof given in the preceding section applies equally well to $L'$ since the set of words $P$ we used belongs to $L'$; the rest of the proof uses the fact that certain words are not in $L$, and since $L \supset L'$, they are a fortiori not in $L'$.

Generalization

We have seen that the lower bound holds for both $L$ and $L'$. When does the proof hold for a language $\Lambda$? We used only two properties of $\Lambda$:

  1. $\Lambda$ contains $P$.
  2. If $w_1,w_2 \in P$, $x_1,x_2$ are the subwords of $w_1,w_2$ chosen using the subword lemma, and $w_1(x_1:=x_2),w_2(x_2:=x_1) \in \Lambda$ (i.e. the word resulting by replacing $x_1$ with $x_2$ in $w_1$, and the word resulting by replacing $x_2$ with $x_1$ in $w_2$) then $x_2$ is a permutation of $x_1$.

When will the second condition fail? The words $x_1,x_2$ both have no repeated symbols. If they are not permutations of each other then $w_1(x_1:=x_2)$ will have a symbol repeated $1$ or $3$ times. So words containing such symbols must lie out of $\Lambda$. Considering $w_2(x_2:=x_1)$, we see that it is enough to exclude one of the possibilities. We get the following theorem.

Theorem. Suppose $\Lambda$ is a language containing $P$ in which no word contains any symbol exactly once, or in which no word contains any symbol exactly $3$ times. Then any grammar for $\Lambda$ must have size $\Omega(n^{-3/4} C^n)$, where $C = \sqrt{3}/\sqrt[3]{2}$.

As an example, the theorem applies for the language in which the number of occurrences of each symbol lies in $S$ as long as $2 \in S$ and $1,3 \notin S$. For example, $L$ corresponds to $S = \{0,2,4,6,\ldots\}$ and $L'$ corresponds to $S = \{0,2\}$.

We leave the reader the following easy generalizations.

Theorem. Suppose $\Lambda$ is a language containing $\{ \pi(\Sigma)^\ell \}$ for some $\ell > 0$, in which no word contains any symbol exactly $\ell + \epsilon$ times, where $\epsilon$ is either $1$ or $-1$. Then any grammar for $\Lambda$ must have size $\Omega(n^{-3/4} C^n)$, where $C = \sqrt{3}/\sqrt[3]{2}$.


Original, condensed proof

Let $L$ be the language of words in which each symbol appears an even number of times. We give an exponential lower bound (in $n = |\Sigma|$) on the size of a Chomsky Normal Form (CNF) grammar for $L$. This implies an exponential lower bound on the size of any context-free grammar for $L$, since (according to Wikipedia) the blowup to CNF is at most quadratic.

Given any derivation of a word $w$ of length $m$ and any $\ell < m/2$ we can always find a non-terminal symbol in the derivation which derives a subword $x$ of $w$ of length $\ell \leq |x| < 2\ell$: start from the root of the derivation, and always pick the larger branch. The first time we hit a branch smaller than $2\ell$, it must be at least $\ell$ in size.

Consider the set of all $S = n!$ words in $L$ of the form $\pi\pi$ (all of size $2n$), where $\pi$ is a permutation of $\Sigma$. For each such word $w$ we can find a symbol $s_w$ generating a subword $x_w$ of size $n/3 \leq |x_w| < 2n/3$. Note that all symbols in $x_w$ are distinct. Therefore if $s_u = s_v$ then $x_u,x_v$ must be permutations of each other.

Given $w$, for how many words $u$ can it be that $s_w = s_u$? To get from $w$ to $u$ we need to permute $x_w$, to permute the rest of the permutation, and to choose the starting location of $x_u$ within the permutation generating $w_u$. All in all, the number of possibilities is at most $$M = n|x_w|!(n-|x_w|)! \leq n(n/3)!(2n/3)!.$$ Therefore the number of distinct symbols in the grammar is at least $$\frac{S}{M} = \frac{n!}{n(n/3)!(2n/3)!} = O\left(\frac{c^n}{n^{1.5}}\right),$$ where $c = 3/2^{2/3} \approx 1.88988$ (thanks to Wolfram Alpha).

3
On

I have no idea how to comment. But I do know how to answer. Sorry about that.

I considered the pumping lemma. And I considered how many words are necessary to pump the language.

Consider the language over {0,1}.

We need the following words: 00, 11, 0101, and 1010. To pump we can combine any words by concatenation or insertion. For this example a grammar would be

S => | S0S0S | S1S1S | S0S1S0S1S | S1S0S1S0S

A quick count shows the number of words necessary for the grammar with n terminals is exponential in n. I guess that shows there is no polynomial sized grammar.

4
On

jerr18 ---

Ignoring the fact that my posted "answer" was clearly indicated to be a "comment" ...

Perhaps you do not understand what a proof is. As a writer: a proof is what makes me believe a result. As a reader whatever you wish to accept is a proof. A counterexample is what is needed to show a proof is defective. Sometimes a believer will use a comment to provide a counterexample. (I am not asking that you make me disbelieve my comment/proof by providing a counter example.)

As for your question, I believe that the empty language is poly for any alphabet and it satisfies your requirement that every terminal appears an even number of times. I suppose your question should be reposed to conform to the effort that Yuval Filmus put forth.