How to model a doubling process with errors

89 Views Asked by At

An original document is duplicated. With probability $p_{err}$ a mistake is made in the copy (we will dub the resulting document the "wrong" document.) Both the original and the copy (right, or wrong) are then duplicated. In total there are $d$ duplication events where all current documents are duplicated. What is the probability distribution of the number of correct documents over the total (fraction correct documents $F_R$) after $d$ duplication events?

Attempted solutions.

1.Recurrence relation. If $R(d)$ is the number of right documents after $d$ duplication events,

$R(d+1)=R(d)+(1-p_{err})R(d) = R(d)(2-p_{err})$, $R[0]=1$

and total number of documents is $2^d$, then the fraction correct $F_R$ after $d$ duplication events is

$F_R=(1-\frac{p_{err}}{2})^d$

However, this is just the mean fraction of correct documents, and it assumes that partial documents can be created (e.g., if $d=1$, and $p_{err}$=.01, there are 2 documents total ($2^1$), of which 1.99 documents are "right", giving $F_R=.995$. Not sure how to get the variance, or the whole distribution.

2.Master equation. Given $R$ correct documents, there is a propensity $R(1-p_{err})$ to make another correct document, and a propensity $R p_{err}$ to make an incorrect document. Then the master equation for the time evolution of (R)ight and (W)rong documents is:

$\frac{\partial p(R,W)}{\partial t}=(R-1)(1-p_{err})p(R-1,W)+R p_{err} p(R,W-1) - R p_{err}(R,W)$

However, this assumes a continuous time process where there are no duplication events per se, since R and W grow one by one (though, the graph of R vs time should grow exponentially). Also, even if I have the solution as a function of time, it's not clear how to draw the probability distribution of R/(R+W) from the solution.

I don't really want the answer, but can someone point me in the right direction? What sort of model / approach should I be taking? This is a simple problem to simulate, but I'm interested in the analytical distribution, if it exists.

EDIT. Working out the probabilistic recurrence from user2566092.

OK I think the recurrence is actually $P(k_{d+1})=\sum_{k_{d}=1}^{k_{d}=2^{d-1}}{k_d\choose{2k_d - k_{d+1}}} p^{2k_d - k_{d+1}}(1-p)^{k_{d+1}-k_d} P(k_d)$

If this is right, then for d=0, the outcomes for $k_{d+1}=[1,2]$ are

$[p,1-p]$

now iterate $d$, $d=1$, where $k_d$ is as above, s.t. $k_{d+1}=[1,2,3,4]$ are

$[p^2, (1 - p) p + (1 - p) p^2, 2 (1 - p)^2 p, (1 - p)^3]$

Now it's just a matter of pattern recognition I think.. any way to get mathematica to figure this out? Working on it..

3

There are 3 best solutions below

4
On BEST ANSWER

As often in questions related to branching processes, writing explicit recursion formulas is easy but solving them to deduce exact distributions for general $n$ is impossible.

In the case at hand, the distribution of the number $X_n$ of correct copies at time $n$ is fully characterized by the initial condition $X_0=1$ and by the stochastic recursion $$ X_{n+1}=X'_n+BX''_n\qquad\text{in distribution,} $$ where $B$ is a Bernoulli random variable such that $P[B=1]=1-p_e$ and $P[B=0]=p_e$ and $X_n'$ and $X_n''$ are i.i.d. copies of $X_n$, both independent of $B$.

Likewise, the proportion $Y_n=2^{-n}X_n$ of correct copies at time $n$ is fully characterized by the initial condition $Y_0=1$ and by the stochastic recursion $$ 2Y_{n+1}=Y'_n+BY''_n\qquad\text{in distribution,} $$ with the same convention. Note that this recursion encodes every moment of $Y_n$, for example $2E[Y_{n+1}]=(1+E[B])\cdot E[Y_n]$ and $E[B]=1-p_e$ hence $$ E[Y_n]=r^n\qquad\text{with}\quad r=1-\tfrac12p_e. $$ Likewise, $4Y_{n+1}^2=(Y'_n)^2+B(Y''_n)^2+2BY'_nY''_n$ in distribution. Integrating this and solving the recursion on $\mathrm{var}(Y_n)$, one deduces that $$ \mathrm{var}(Y_n)=\tfrac12p_er^{n-1}\left(r^n-\left(\tfrac12\right)^n\right). $$ When concerned with large values of $n$ however, one rather uses the general result that $Y_n/E[Y_n]$ converges in distribution to some nonnegative random variable $W$ such that $$ E[W]=1,\qquad 2rW=W'+BW''\quad\text{in distribution.} $$ This entirely determines the distribution of $W$ and one can rely on $Y_n\sim r^nW$ to deduce that $\mathrm{var}(Y_n)\sim r^{2n}\mathrm{var}(W)$. Furthermore, $4r^2W^2=(W')^2+2BW'W''+B(W'')^2$ hence, integrating both sides and using the values $E[W]=1$ and $E[B]=2r-1$, one gets $$ r\cdot\mathrm{var}(W)=1-r=\tfrac12p_e, $$ which confirms the previous result that $\mathrm{var}(Y_n)\sim\frac12p_er^{2n-1}$ when $n\to\infty$.

2
On

Often using recurrences is the right approach, and for a discrete problem like this typically you find a recurrence that relates one discrete state to other "previous" discrete states. In your case, if you want to know the probability that you have $k'$ correct documents after $d$ duplications, note that if you previously had $k$ correct documents after $d-1$ duplications, then you made exactly $2k - k'$ errors when you duplicated the $k$ previous correct documents, and the probability of this occurring is ${k\choose{2k - k'}} p^{2k - k'}(1-p)^{k'-k}$. So if the next duplication leaves you $k'$ correct documents, if you use the coefficient formula above to sum up probability expressions for all possible previous values $k$ of previous correct documents, you will get a recurrence relation for the probability that you can at least evaluate efficiently (e.g. if you use a recursive code function that remembers the probabilities it has already calculated) even if you can't use the recurrence to prove a closed-form formula. In many cases though you can use the recurrence to prove a formula, e.g. by induction if you first guess the formula by looking at it for some initial values of $d$.

2
On

It is a branching process. If the polynomial generating function that records the probability of $i$ true copies after $n$ stages of duplication is $Q_n(x)=\sum p_{i,n} x^i$, the recurrence is $$ Q_n(t) = Q_{n-1}(qt^2 + pt)$$ where $q=1-p$ and $Q_0(t)=t$.

This is a supercritical Galton-Watson process. One true copy produces an average of $(2-p)$ true copies, including itself, in the next generation. If $T_n$ is the number of true copies at stage $n$, $\frac{T_n}{(2-p)^n}$ converges almost surely.

Start here

http://en.wikipedia.org/wiki/Branching_process
http://en.wikipedia.org/wiki/Galton%E2%80%93Watson_process