Finding where the tail starts for a probability distribution, from its generating function

337 Views Asked by At

Suppose we generate "random strings" over an $m$-letter alphabet, and look for the first occurrence of $k$ consecutive identical digits. I was with some effort able to find that the random variable $X$, denoting the time until we see $k$ consecutive digits, has probability generating function

$$P(z) = \sum_{n \ge 0} \Pr(X=n)z^n = \frac{(m-z)z^k}{m^k(1-z) + (m-1)z^k}$$

This correctly gives the expected time until we see $k$ consecutive $1$s, as

$$\operatorname{E}[X] = P'(1) = \frac{m^k - 1}{m - 1}$$

(For example, in a random stream of decimal digits, we expect to see $10$ consecutive identical digits after an expected number of $111111111$ steps.)

Using Sage to compute the coefficients of this generating function for $m=10$ and various small $k$, I was able to find the exact (smallest) $N_1(m, k)$ for which $\Pr(X \le N_1) \ge \frac12$, and $N_2(m, k)$ for which $\Pr(X \le N_2) \ge \frac9{10}$ (i.e. by which time we can be "quite" sure of having seen $k$ consecutive digits):

  • $2$ consecutive digits: half at $N_1 = 8$, "quite" at $N_2 = 23$.
  • $3$ consecutive digits: $N_1 = 78$ and $N_2 = 252$.
  • $4$ consecutive digits: $N_1 = 771$ and $N_2 = 2554$.
  • $5$ consecutive digits: $N_1 = 7703$ and $N_2 = 25578$.
  • $6$ consecutive digits: $N_1 = 77018$ and $N_2 = 255835$.
  • [$7$ consecutive digits: hit limitations of my computer (or programming skills).]

There is clearly a pattern there, and I'd like to be able to calculate (either exactly or approximately) the value of $N_1(m, k)$ and $N_2(m, k)$ for larger values. Is there a technique that would give the (possibly asymptotic) values of $N_2(m, k)$ say?

3

There are 3 best solutions below

9
On

Since the generating function is rational, we can say a few things. First, the sequence of probabilities satisfies a linear recurrence. Specifically, $$m^k p_n - m^k p_{n-1} + (m-1)p_{n-k} = 0,$$ which we may prefer to write as $$p_n = p_{n-1} - \frac{m-1}{m^k} p_{n-k}.$$ The linear recurrence is not memory intensive, but I don't think it does significantly better than Sage. My machine finds (via a short Python script) 770164 and 2558418 for "half" and "quite" for $m=10$ and $k = 7$ in about 8 seconds.

By Theorem 4.1.1 of Stanley's Enumerative Combinatorics (Volume 1), if the denominator of your generating function has no repeated roots, we expect $p_n$ to be asymptotic to $A\cdot \left(\frac{1}{\alpha}\right)^n$, where $\alpha$ has the smallest modulus among the roots (for some constant $A$). The sum of the first $n$ probabilities should be approximately $1 - C\cdot\left(\frac{1}{\alpha}\right)^n$ for some $C$. Upon finding $\alpha$ and $C$, these approximations will be much faster to compute than exact values.

It appears to be difficult to find a nice expression $\alpha$, however, $$\alpha = 1 + \frac{m-1}{m^k - k(m-1)}$$ is a reasonably good approximation.

Upon finding $C$ and $\alpha$, we have approximations for $N_1(m,k)$ and $N_2(m,k)$ by solving $1 - C(\frac{1}{\alpha})^n = p$. The general solution is $$N_p(m,k) \approx \frac{\ln C -\ln(1-p)}{\ln(\alpha)}.$$For convenience, I used $C = 1$ to check against your values. This gave approximations of $N_1 = 7, 76, 768, 7699, 77013$ and $N_2 = 23,251,2551,25574,255831$ when $m = 10$ and $2 \leq k \leq 6$.

Finally, observe that transitioning from $k$ to $k + 1$ resulted in the $N$'s being multiplied by $m$ (approximately). Again, the binomial approximation is suggestive whenever $\alpha \approx 1 + \frac{m-1}{m^k}$: $$\left(1 + \frac{m-1}{m^k}\right)^n \approx \left(1 + \frac{n(m-1)}{m^k}\right) = \left(1 + \frac{nm(m-1)}{m\cdot m^k}\right) \approx \left(1 + \frac{m-1}{m^{k+1}}\right)^{nm}$$

2
On

Here is a quick approximation based on the Newton-Raphson method and the partial fraction decomposition of rational functions.

First, the generating function of the cumulative distribution $\sum {\rm Pr}(X\le n)z^n$ is formed from the generating function of the exact distribution $\sum {\rm Pr}(X=n)z^n$ by multiplying by a factor if $1/(1-z)$. So, define $$Q(z)=\frac{P(z)}{1-z}=\frac1{1-z}\cdot\frac{(m-z)z^k}{m^k(1-z)+(m-1)z^k}. $$ For an arbitrary probability $p$, your question is equivalent to solving for $n$ in $[z^n]Q(z)=p$. This $Q(z)$ has a partial fraction decomposition $$Q(z)=\frac1{1-z}+\frac{m^k-z^k} {m^k(1-z)+(m-1)z^k}.$$

In general, for a rational function $f/g$ with $\deg f<\deg g$ where $g$ has distinct, simple roots $r_1,\ldots, r_k$, $$\frac fg=\sum_i \frac{\alpha_i}{z-r_i},\text{ where each } \alpha_i=\frac{f(r_i)}{g'(r_i)}.$$ Expanding each term as a geometric series, we have $$ \frac fg=\sum_i\sum_{n}-\frac{\alpha_i}{r_i^{n+1}}z^n.$$ Now if $r_{m}$ is the smallest of the roots (in absolute value), then the first order approximation of $[z^n]f/g$ is $-\alpha_{m}/r_m^{n+1}$.

There is not an exact formula for the roots of the denominator $g(z)=m^k(1-z)+(m-1)z^k$ of $P(z)$. However, by computational experimentation, it seems that the smallest root is slightly bigger than 1. (I suspect it wouldn't take too much to prove this.) Taking $z_0=1$ in the Newton-Raphson method, the first iterate $z_1=z_0-g(z_0)/g'(z_0)\approx 1+(m-1)/m^k$ is a close approximation of the smallest root of $g$. By combining these two approximations, $$[x^n]\frac 1{m^k(1-z)+(m-1)z^k}\approx \frac1{m^k\left(1+\frac{m-1}{m^k}\right)^{n+1}}.$$ Then, after some algebra, $$[x^n]Q(z)\approx1-\frac1{m^k\left(1+\frac{m-1}{m^k}\right)^{n+1}}.$$ Finally, if $[x^n]Q(z)=p$, then solving the previous equation for $n$ we have $$n\approx-\log(1-p)\frac{m^k}{m-1}$$ This approximation agrees nicely with your data.

As a side note, my preference for obtaining the original generating function $P(z)$ would be the Goulden-Jackson cluster method. This method is more general than the one you referenced, it's easier to learn, and it has a wide range of applications. For sure, it's one of my favorites.

0
On

Here is an approximation based on Poisson Clumping Heuristic ($p=m^{-(k-1)}$ small): $$\tau\overset{d}\approx k+\exp(\lambda)\tag 1$$ for $$\lambda=\frac p {EC}=\frac {m-1}{m^k}\tag 2$$ where $p=m^{-(k-1)}$ is the probability that each given block of $k$ letters is a "hit" and $EC$ is expected clump size of such "hit" blocks (each such block is followed by expected $\frac 1 {m-1}$ letters of the same kind for the expected clump size of $EC=\frac m{m-1}$. From $(1)$ you can get the non-extreme quantiles (agrees with approximation by @Rus May) and your exact numbers agree with the above relationship (increasing $k$ by $1$ multiplies quantiles by $m$).