Suppose I have the following formal grammar:
$$ S \rightarrow \varepsilon + a S + b S + c S d S $$
where $S$ is a nonterminal symbol and $a$, $b$, $c$, $d$ are terminal symbols. How can I sample uniformly from the set of $n$-length strings that are generated by this grammar, without having to generate all $n$-length strings first?
For example, if the grammar were just
$$ S \rightarrow \varepsilon + a S + bS $$
we could generate a random $n$-length string by choosing $\varepsilon$ if $n = 0$, and otherwise choosing between $a$ and $b$ with equal probability followed by a random $(n-1)$-length string.
I managed to solve the problem with an exact procedure. First, we find the generating function for the grammar:
\begin{align} y &= 1 + x y + x y + x y x y \\ y &= \frac{1 - 2x - \sqrt{1 - 4x}}{2x^2} \\ &= 1 + 2x + 5x^2 + 14x^3 + 42x^4 + 132x^5 + 429x^6 + 1430x^7 + \ldots \end{align}
The coefficient $y_n$ of $x^n$ tells us how many strings of length $n$ there are in the grammar. In this case, the coefficients are exactly the Catalan numbers, shifted by one. They satisfy the recurrence relation
$$ y_n = \begin{cases} 1 & n = 0 \\ \frac{2(2n+1)}{n+2} y_{n-1} & \text{otherwise} \end{cases} $$
We can pre-compute $y_n$ for all $n$ up to the maximum length we will sample. Next, we use the technique for random generation of combinatorial structures described in section 3 ("basic generation schemes") of A calculus for the random generation of labelled combinatorial structures by Flajolet et al:
Adapted to Python code:
This outputs, for instance,
Notice the (normalized) counts are all close to 1, as would be expected from a uniform sampling.