Sampling uniformly from the language generated by a grammar

Question

Sampling uniformly from the language generated by a grammar

184 Views Asked by Bumbble Comm At 25 Mar 2026 - 10:09

Suppose I have the following formal grammar:

$$ S \rightarrow \varepsilon + a S + b S + c S d S $$

where $S$ is a nonterminal symbol and $a$, $b$, $c$, $d$ are terminal symbols. How can I sample uniformly from the set of $n$-length strings that are generated by this grammar, without having to generate all $n$-length strings first?

For example, if the grammar were just

$$ S \rightarrow \varepsilon + a S + bS $$

we could generate a random $n$-length string by choosing $\varepsilon$ if $n = 0$, and otherwise choosing between $a$ and $b$ with equal probability followed by a random $(n-1)$-length string.

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

I managed to solve the problem with an exact procedure. First, we find the generating function for the grammar:

\begin{align} y &= 1 + x y + x y + x y x y \\ y &= \frac{1 - 2x - \sqrt{1 - 4x}}{2x^2} \\ &= 1 + 2x + 5x^2 + 14x^3 + 42x^4 + 132x^5 + 429x^6 + 1430x^7 + \ldots \end{align}

The coefficient $y_n$ of $x^n$ tells us how many strings of length $n$ there are in the grammar. In this case, the coefficients are exactly the Catalan numbers, shifted by one. They satisfy the recurrence relation

$$ y_n = \begin{cases} 1 & n = 0 \\ \frac{2(2n+1)}{n+2} y_{n-1} & \text{otherwise} \end{cases} $$

We can pre-compute $y_n$ for all $n$ up to the maximum length we will sample. Next, we use the technique for random generation of combinatorial structures described in section 3 ("basic generation schemes") of A calculus for the random generation of labelled combinatorial structures by Flajolet et al:

Adapted to Python code:

coef = [1]
for n in range(1, 20):
  coef.append(int(2 * (2 * n + 1) / (n + 2) * coef[n - 1]))

def sample_grammar(n):
  if n == 0:
    return ''
  elif np.random.uniform() < 2 * coef[n - 1] / coef[n]:
    return np.random.choice(['a', 'b']) + sample_grammar(n - 1)
  else:
    u = np.random.uniform()
    k = 0
    s = coef[0] * coef[n - 2] / (coef[n] - 2 * coef[n - 1])
    while s < u:
      k += 1
      s += coef[k] * coef[n - 2 - k] / (coef[n] - 2 * coef[n - 1])
    return 'c' + sample_grammar(k) + 'd' + sample_grammar(n - 2 - k)

counter = Counter([sample_grammar(4) for i in range(100000)])
# Print the counts, normalized so they should all be approximately 1
print('\n'.join(
  '{}\t{:.4f}'.format(string, count / sum(counter.values()) * len(counter.values())) 
  for string, count in sorted(counter.items()))
)

This outputs, for instance,

aaaa    0.9949
aaab    1.0006
aaba    0.9936
aabb    0.9926
aacd    0.9942
abaa    1.0091
abab    0.9979
abba    1.0022
abbb    1.0055
abcd    1.0007
acad    1.0027
acbd    1.0120
acda    0.9951
acdb    0.9947
baaa    0.9924
baab    0.9992
baba    0.9991
babb    1.0020
bacd    0.9967
bbaa    0.9944
bbab    1.0046
bbba    0.9897
bbbb    0.9996
bbcd    1.0148
bcad    1.0068
bcbd    0.9946
bcda    1.0080
bcdb    1.0028
caad    0.9959
cabd    0.9960
cada    0.9938
cadb    0.9967
cbad    1.0082
cbbd    1.0093
cbda    1.0035
cbdb    0.9946
ccdd    1.0085
cdaa    1.0038
cdab    0.9946
cdba    1.0067
cdbb    0.9980
cdcd    0.9899

Notice the (normalized) counts are all close to 1, as would be expected from a uniform sampling.

Sampling uniformly from the language generated by a grammar

There are 1 best solutions below

Related Questions in COMBINATORICS

Related Questions in GENERATING-FUNCTIONS

Related Questions in FORMAL-LANGUAGES

Related Questions in FORMAL-GRAMMAR

Trending Questions

Popular # Hahtags

Popular Questions