An original document is duplicated. With probability $p_{err}$ a mistake is made in the copy (we will dub the resulting document the "wrong" document.) Both the original and the copy (right, or wrong) are then duplicated. In total there are $d$ duplication events where all current documents are duplicated. What is the probability distribution of the number of correct documents over the total (fraction correct documents $F_R$) after $d$ duplication events?
Attempted solutions.
1.Recurrence relation. If $R(d)$ is the number of right documents after $d$ duplication events,
$R(d+1)=R(d)+(1-p_{err})R(d) = R(d)(2-p_{err})$, $R[0]=1$
and total number of documents is $2^d$, then the fraction correct $F_R$ after $d$ duplication events is
$F_R=(1-\frac{p_{err}}{2})^d$
However, this is just the mean fraction of correct documents, and it assumes that partial documents can be created (e.g., if $d=1$, and $p_{err}$=.01, there are 2 documents total ($2^1$), of which 1.99 documents are "right", giving $F_R=.995$. Not sure how to get the variance, or the whole distribution.
2.Master equation. Given $R$ correct documents, there is a propensity $R(1-p_{err})$ to make another correct document, and a propensity $R p_{err}$ to make an incorrect document. Then the master equation for the time evolution of (R)ight and (W)rong documents is:
$\frac{\partial p(R,W)}{\partial t}=(R-1)(1-p_{err})p(R-1,W)+R p_{err} p(R,W-1) - R p_{err}(R,W)$
However, this assumes a continuous time process where there are no duplication events per se, since R and W grow one by one (though, the graph of R vs time should grow exponentially). Also, even if I have the solution as a function of time, it's not clear how to draw the probability distribution of R/(R+W) from the solution.
I don't really want the answer, but can someone point me in the right direction? What sort of model / approach should I be taking? This is a simple problem to simulate, but I'm interested in the analytical distribution, if it exists.
EDIT. Working out the probabilistic recurrence from user2566092.
OK I think the recurrence is actually $P(k_{d+1})=\sum_{k_{d}=1}^{k_{d}=2^{d-1}}{k_d\choose{2k_d - k_{d+1}}} p^{2k_d - k_{d+1}}(1-p)^{k_{d+1}-k_d} P(k_d)$
If this is right, then for d=0, the outcomes for $k_{d+1}=[1,2]$ are
$[p,1-p]$
now iterate $d$, $d=1$, where $k_d$ is as above, s.t. $k_{d+1}=[1,2,3,4]$ are
$[p^2, (1 - p) p + (1 - p) p^2, 2 (1 - p)^2 p, (1 - p)^3]$
Now it's just a matter of pattern recognition I think.. any way to get mathematica to figure this out? Working on it..
As often in questions related to branching processes, writing explicit recursion formulas is easy but solving them to deduce exact distributions for general $n$ is impossible.
In the case at hand, the distribution of the number $X_n$ of correct copies at time $n$ is fully characterized by the initial condition $X_0=1$ and by the stochastic recursion $$ X_{n+1}=X'_n+BX''_n\qquad\text{in distribution,} $$ where $B$ is a Bernoulli random variable such that $P[B=1]=1-p_e$ and $P[B=0]=p_e$ and $X_n'$ and $X_n''$ are i.i.d. copies of $X_n$, both independent of $B$.
Likewise, the proportion $Y_n=2^{-n}X_n$ of correct copies at time $n$ is fully characterized by the initial condition $Y_0=1$ and by the stochastic recursion $$ 2Y_{n+1}=Y'_n+BY''_n\qquad\text{in distribution,} $$ with the same convention. Note that this recursion encodes every moment of $Y_n$, for example $2E[Y_{n+1}]=(1+E[B])\cdot E[Y_n]$ and $E[B]=1-p_e$ hence $$ E[Y_n]=r^n\qquad\text{with}\quad r=1-\tfrac12p_e. $$ Likewise, $4Y_{n+1}^2=(Y'_n)^2+B(Y''_n)^2+2BY'_nY''_n$ in distribution. Integrating this and solving the recursion on $\mathrm{var}(Y_n)$, one deduces that $$ \mathrm{var}(Y_n)=\tfrac12p_er^{n-1}\left(r^n-\left(\tfrac12\right)^n\right). $$ When concerned with large values of $n$ however, one rather uses the general result that $Y_n/E[Y_n]$ converges in distribution to some nonnegative random variable $W$ such that $$ E[W]=1,\qquad 2rW=W'+BW''\quad\text{in distribution.} $$ This entirely determines the distribution of $W$ and one can rely on $Y_n\sim r^nW$ to deduce that $\mathrm{var}(Y_n)\sim r^{2n}\mathrm{var}(W)$. Furthermore, $4r^2W^2=(W')^2+2BW'W''+B(W'')^2$ hence, integrating both sides and using the values $E[W]=1$ and $E[B]=2r-1$, one gets $$ r\cdot\mathrm{var}(W)=1-r=\tfrac12p_e, $$ which confirms the previous result that $\mathrm{var}(Y_n)\sim\frac12p_er^{2n-1}$ when $n\to\infty$.