Help with conditional expectation of a convolution of exponential random variables

352 Views Asked by At

I'm working through this paper, with lots of help from all the great people on this site. Obviously my statistics/probability is a lacking to follow all the mathematical steps. Currently, I'm trying to figure out how equation $(12)$ was derived from equation $(8)$:

given the pdf $$f_n(t) = \frac{\binom{n+1}{2}}{2N}\exp\left(-\frac{\binom{n+1}{2}}{2N} t\right)\;\;\; (5) $$ The convolution of $f_{n-1}(t),f_{n-2}(t),\ldots,f_m(t)$ has the expectation and variance: $$E(t) = 2N\sum_{i=m+1}^{n}{\frac{1}{\binom{i}{2}}} = 2N \sum_{i=m+1}^{n}{\left(\frac{2}{i-l} - \frac{2}{i}\right)} = 4N\left(\frac{1}{m}-\frac{1}{n}\right) \;\;\; (8)\\ V(t) = 4N^2\sum_{i=m+1}^{n}{\frac{1}{\binom{i}{2}^2}} \;\;\; (9) $$ given that the starting frequency of a mutant is $\frac{k}{2N}$, show the expectation of $E\left(t\mid\frac{k}{2N}\right)$ (mean time for $k$ alleles to go to $2N$ alleles, also known as the fixation time (since there are $2N$ alleles in the population)) is: $$ E\left(t\mid\frac{k}{2N}\right)=4N\left(1-\frac{1}{2N} - \sum_{i=1}^{k-1}{\frac{1}{i(i+1)}}\prod_{j=1}^{i}\frac{k-j}{2N-j} \right) \;\;\; (12) $$

so basically the exponential distribution listed above (equation $(5)$) is the probability that $N+1$ randomly sampled alleles come from $N$ ancestors. I don't understand why the conditional expectation is different when you start with a known number of alleles, you just be able to plug $k$ and $2N$ into equation $(8)$: $$ E(t) = 4N\left(\frac{1}{k}-\frac{1}{2N}\right)\;? $$

I know this is a bit ambiguous, especially without knowing the details of the paper, but I'm sure someone more knowledgeable of statistics/probability can probably identify what the authors are using to derive the last equation.

Also, bonus points if you can derive the following from equation $(12)$ as $N\to\infty$: $$E(t\mid p) = -4N(1/p - 1)\ln(1-p)\;,$$ where $p=\frac{k}{2N}$.

Any help would be amazing!

2

There are 2 best solutions below

0
On BEST ANSWER

(8) is the expected time to trace back from $n$ copies of an allele to $m$ ancestral copies, but in (12) we don't know how many of the initial $k$ copies are going to have descendants in the population at the time of fixation -- some will likely have lineages that die out before. There has to be at least one ancestor, but the number of additional ancestors $I$ could be anything in $\{0,\ldots, k-1\}$. To calculate $E(t)$, we'll take the total time to go from $1$ ancestor to fixation, and then subtract off the time to go from $1$ ancestor to $I+1$ ancestors (given by (8), although (6) will be more convenient), weighted by the probability that $I=i$: $$ E(t) = 4N\left(1-\frac{1}{2N}-\sum_{i=1}^{k-1}\frac{P(I\ge i)}{i(i+1)}\right). $$

To find $P(I\ge i)$, we can now go into the future, past the time when the allele fixes all the way to when the population is descended from a single one of the initial $k$ copies. (This time exists with probability $1$, and will be the same as the fixation time iff $I=0$.) Now go backwards in time to the last point where there was an individual who was not descended from this ultimate winner. With probability $\frac{k-1}{2N-1}$, this individual descended from one of the other $k-1$ copies, and $I \ge 1$; otherwise, $I=0$. Keep going back in time, and with probability $\frac{k-2}{2N-2}$ the next individual with a new ancestor will descend from one of the remaining $k-2$ copies, etc. So the distribution of $I$ is $$ P(I\ge i) = \prod_{j=1}^i \frac{k-j}{2N-j}, $$ and (12) follows.

0
On

I'm unable to understand the meaning of $k$ and $m$ in your equation and hence cannot tell the meaning of equation (12). However, I can give a proof for the asymptotic equation. Hope it will help.
For a fixed $p(0<p<1)$, $k$ will grow to infinity as $N\to\infty$. For simplicity we evaluate the remainder term first: $$\begin{align} \sum_{i=\sqrt[4]{k}}^{k-1}\frac{1}{i(i+1)}\prod_{j=1}^{i}\frac{k-j}{2N-j} & \leq\sum_{i=\sqrt[4]{k}}^{k-1}\frac{1}{i(i+1)}p^i\\ & \leq\sum_{i=\sqrt[4]{k}}^{\infty}\frac{1}{\sqrt{k}}p^i\\ & \leq\frac{1}{\sqrt{k}(1-p)} \end{align}$$ which tends to $0$ as $k\to\infty$.
Now we find the formula for $E(t|p)$. Note that $$\begin{align} \sum_{i=1}^{\sqrt[4]{k}-1}\frac{1}{i(i+1)}\prod_{j=1}^{i}\frac{k-j}{2N-j} & =\sum_{i=1}^{\sqrt[4]{k}-1}\frac{1}{i(i+1)}\prod_{j=1}^{i}\frac{p-j/(2N)}{1-j/(2N)}\\ & =\sum_{i=1}^{\sqrt[4]{k}-1}\frac{1}{i(i+1)}\prod_{j=1}^{i}(p+O(j/N))\\ & =\sum_{i=1}^{\sqrt[4]{k}-1}\frac{1}{i(i+1)}\prod_{j=1}^{i}(p+O(k^{1/4}/N))\\ & =\sum_{i=1}^{\sqrt[4]{k}-1}\frac{1}{i(i+1)}(p^{i}+O(ik^{1/4}/N))\\ & =\left(\sum_{i=1}^{\sqrt[4]{k}-1}\frac{1}{i(i+1)}p^{i}\right)+O(k^{3/4}/N) \end{align}$$ Since $k/2N=p$ is fixed, $O(k^{3/4}/N)=O(N^{-1/4})\to 0$ as $N\to\infty$.
Now the whole sum evaluates to $$\begin{align} \lim_{N\to\infty}\sum_{i=1}^{k-1}\frac{1}{i(i+1)}\prod_{j=1}^{i}\frac{k-j}{2N-j} & =\lim_{N\to\infty}\left(\sum_{i=1}^{\sqrt[4]{k}-1}\frac{1}{i(i+1)}p^{i}\right)\\ & = \sum_{i=1}^{\infty}\frac{p^i}{i(i+1)}\\ & = \sum_{i=1}^{\infty}\frac{p^i}{i}-\sum_{i=1}^{\infty}\frac{p^i}{i+1}\\ & = \sum_{i=1}^{\infty}\frac{p^i}{i}-\frac{1}{p}\sum_{i=2}^{\infty}\frac{p^{i}}{i}\\ \end{align}$$ From the Taylor's formula one may see $\ln(1-x)=-\sum_{i=1}^{\infty}x^i/i$. So the conclusion follows.