How many words are there in a finite context-free grammar in Chomsky normal form?

1.9k Views Asked by At

Given a CFG $G$ written in CNF with $|V|$ variables and $|T|$ terminals, what's the upper bound of the number of words in $L(G)$ if it is finite?

Specifically, the Chomsky normal form requires that all productions in $G$ of the form $$X\rightarrow YZ$$ or $$X\rightarrow a$$ where $X,Y,Z$ are variables, and $a$ is a terminal.

1

There are 1 best solutions below

6
On BEST ANSWER

Knowing just $|V|\geq 1$ and $|T|\geq 1$, there is no upper bound. For example, if $V=\{v\}$ and $T=\{a\}$, you may have rules $v\to a^m$ for as many different $m$ as your heart desires.

If you know $|V|\geq 2$ and $|T|\geq 2$ and also $r\geq 3$, the number of rules, then still you cannot put an upper bound on the size of the language, assuming it is finite. For example, let $V=\{u,v\}$ with $v$ the root variable, and $T=\{a,b\}$, and consider the CFG consisting of the rules $u\to ab^i$, $1\leq i\leq r-1$, and $v\to u^m$; the language consists of $(r-1)^m$ words.

Now, if you know $|V|$, $|T|$, the number $r$ of the rules, and $l$, the maximum number of variables in any rule of the CFG, and the language produced by the CFG is finite, then there exists an upper bound on the size of the language that is expressible by $|V|$, $|T|$, $r$, and $l$. Indeed, every derivation tree has no more that $|V|$ internal nodes, and each node has at most $l$ descendants. There is only
a finite set $\mathcal{S}$ of trees that satisfy these constraints (here we talk of plain trees, without the variables and rules attached to its internal nodes and its edges), and $|\mathcal{S}|$ has an upper bound expressible
by $|V|$ and $l$. We convert a tree $S\in\mathcal{S}$ to a derivation tree by attaching to each of its internal nodes
a rule, which can be done in at most $r\cdot l!$ ways at each node. (Some combinations of attaching the rules do not produce legitimate derivation trees. Some plain trees may even not be convertible to
a derivation tree. But this does not affect our argument; it suffices that the successful combinations of attachments produce all derivation trees.) Therefore, if $\mathcal{D}$ is the set of all derivation trees, then $|\mathcal{D}|\leq |\mathcal{S}|\cdot(r\cdot l!)^{|V|}$. Each derivation tree $D\in\mathcal{D}$ produces the word $w(D)$ of the language $L$ defined by the CFG; the mapping $w\colon\mathcal{D}\to L$ is surjective, so our upper bound for $|\mathcal{D}|$ is also an upper bound for $|L|$. We have obtained a very crude upper bound; you may lower it by a more careful thinking. But the main point here is that given the data $|V|$, $|T|$, $r$, and $l$ you can give
an upper bound on the size of the language.

[Later.] $~$What a bummer: I completely overlooked that the CFG is given in CNF (Chomsky normal form).$\,$ (This was not yet so explicitly emphasized as it is now, in the edited question. Still, it was sloppy of me to miss it.)$\,$ Let's write $n:=|V|$ and $m:=|T|$. Since we have CNF, we have an upper bound for $r$, the number of production rules, namely $r\leq n(n\!-\!1)^2 + nm + 1$ (the last $1$ is for the rule $s\to\varepsilon$, if present), and of course $l=2$. The crude reasoning above then does produce an upper bound on $|L|$ that is a function of $n$ and $m$. With some patience (nad more careful reasoning) one could produce an actual formula for the upper bound. Alas, this will not be me: currently I possess very little patience, if any$\ldots$

[Next day, early morning.] $~$During a sleepless night my impatient mathematical mind stumbled over an insight and fell nose forward into the following revelation:

Proposition. $~$Let $G$ be a context-free grammar in Chomsky normal form, with a set $V$
of variables and a set $T$ of terminal symbols. If $G$ produces a finite language $L$, then $|L|\leq F(|V|,|T|)$, where $$ F(n,m) \,=\, \sum_{k=0}^{2^{n-1}} m^k~, \qquad\quad n\geq 1\,,~m\geq 0\,. $$ The given upper bound is exact: for any integers $n\geq 1$, $m\geq 0$ there exists a context-free grammar in Chomsky normal form, with $n$ variables and $m$ terminal symbols, which produces
a language $L$ with $|L|=F(n,m)$.

We have $F(n,0)=1$ and $F(n,1)=2^{n-1}+1$, while for $m > 1$ we can of course write $$ F(n,m) \,=\, \frac{m^{2^{n-1}+1}-1}{m-1}~. $$

Proof. $~$Let $G$ be as in the proposition, and let $S$ be the root variable (i.e., the "start symbol").
Call a nonroot variable $A$ active if there exists a derived production $S\to uAv$, where $u,v\in V^*$ (and $uv\neq\varepsilon$), and call $A$ productive if there exists a derived production $A\to w$ with $w\in T^*$
(and $w\neq\varepsilon$). We shall say that the grammar $G$ is trim if all its nonroot variables are active and productive. If the grammar $G$ is not trim, we keep removing nonroot variables that are not active or not productive, also removing all rules that contain any removed variables, until all the remaining nonroot variables are both active and productive, and so obtain a trim grammar $G'$ which produces the same language as the grammar $G$.

For integers $n\geq 1$ and $m\geq 0$ let $F(n,m)$ be the largest cardinality of a finite language produced by a CFG in CNF with $|V|\leq n$ and $|T|\leq m$. It it clear that in our task of determining $F(n,m)$ we can restrict ourselves to trim grammars.

So let $G$ be a trim CFG in CNF. We construct a directed graph $P$: the set of vertices is $V$, and for every production rule $A\to B\mspace{1mu}C$ we draw two directed edges $A\to B$ and $A\to C$. If the graph $P$ contains a directed cycle, then $G$ produces an infinite language. Indeed, the cycle does not contain the root variable; let $A$ be a nonroot variable on the cycle. Going around the cycle, from $A$ to $A$, we obtain a derived production $A\to u_1\mspace{-1mu}A\mspace{1mu}u_2$, where $u_1,u_2\in V^*$ and $u_1u_2\neq\varepsilon$. Going around the cycle $r$ times, we get the derived production $A\to u_1^r\mspace{-1mu}A\mspace{1mu}u_2^r$. Since $A$ is active, there exists a derived production $S\to v_1\mspace{-1mu}A\mspace{1mu}v_2$. With all nonroot variables productive it is clear that $G$ produces arbitrarily long words in $T^*$.

Now suppose that $G$ is a CFG in CNF that produces a finite language $L$. The associated directed graph $P$ is acyclic, therefore we can enumerate the variables as $A_1$, $A_2$, $\ldots$, $A_n$, where $S=A_n$, so that every production rule $A_k\to A_iA_j$ has $k>i,\,j$. We construct a CFG $H$ in CNF with the set $V$ of variables, with the set $T$ of terminal symbols, and with the following production rules: rules $A_k\to A_iA_j$ for all integers $i$, $j$, $k$ such that $1\leq i,\,j<k\leq n$; rules $A_k\to a$ for $1\leq k\leq n$ and all $a\in T$; and the rule $S\to\varepsilon$. The grammar $H$ produces a finite language $K$, which contains the language $L$. If $n=1$, then the root variable $S$ is the only variable in $H$ and the production rules
of $H$ are $S\to a$ for all $a\in T$ and the rule $S\to\varepsilon$, thus $H$ produces the language $\{\varepsilon\}\cup T^1$. Suppose that $n>1$. For $1\leq k\leq n$ let $K_k$ be the language produced by the grammar $H$ from
the variable $A_k$ (considered as a root variable). The following is easy to prove: \begin{align*} K_1 &\,=\, T^1~,\\ K_k &\,=\, T^1 \cup K_{k-1}^2~, \qquad \text{for $1<k<n$}~, \\ K_n &\,=\, \{\varepsilon\} \cup T^1 \cup K_{n-1}^2~. \end{align*} It follows, by induction on $k$, that for $1\leq k< n$ we have $K_k=\{w\in T^*\mid 1\leq|w|\leq 2^{k-1}\}$, whence $$ K\,=\,K_n\,=\,\bigl\{w\in T^* \,\big|\, |w|\leq 2^{n-1}\bigr\}~, $$ which holds also for $n=1$.$~$ Done.