Distribution of the first repeated value in an i.i.d. sequence

125 Views Asked by At

Here is a fun problem posed by a friend.

Consider a random variable $Y_t$ taking values in a discrete set $S$ with a probability density $\pi(y)$ for all times $t$ $\in$ $\{1, \dots \}$, such that $Y_t$ is $\mathrm{iid}$ for all $t$. Define a new random variable $Z$ to be the first value of $Y_t$ that repeats. That is, $Z = Y_{\min\{t \mid Y_t \in \{Y_1, \dots, Y_{t-1}\}\}}$. What is the density $\mathbb{P}(Z = z)$?

I think I have an answer, but would like to see what others come up with!

1

There are 1 best solutions below

1
On

Consider the case of finite support $S$. Without loss of generality, take $S=\{1,2,3,\ldots,n\},$ and define $Z_n := Y_{\min\{t \mid Y_t \in \{Y_1, \dots, Y_{t-1}\}\}}$, with $p_k:=\mathbb{P}(Y_i=k)$ for each $k\in S.$

We find $\mathbb{P}(Z_n=k)$ as a function of $(p_1,\ldots,p_n)$, first considering some small cases, where we let $Y$ be the string $(Y_1Y_2\ldots Z_n)$, $Z_n$ being the first of the $n$ possible values to be repeated:

$n=2:$ $$\begin{align} \mathbb{P}(Z_2=1)&=P(Y \in \{11, 211, 121\})=p_1^2 + 2p_2p_1^2=p_1^2(1+2p_2)\\ \mathbb{P}(Z_2=2)&=P(Y \in \{22, 122, 212\})=p_2^2 + 2p_1p_2^2=p_2^2(1+2p_1) \end{align}$$

$n=3:$ $$\begin{align} \mathbb{P}(Z_3=1)&=P(Y \in \{11, 211,121, 311,131, 2311,2131,1231,3211,3121,1321\})\\ &=p_1^2(1+2(p_2+p_3)+6p_2p_3)\\ \mathbb{P}(Z_3=2)&=P(Y \in \{22, 122,212, 322,232, 1322,1232,2132,3122,3212,2312\})\\ &=p_2^2(1+2(p_1+p_3)+6p_1p_3)\\ \mathbb{P}(Z_3=3)&=P(Y \in \{33, 233,323, 133,313, 2133,2313,3213,1233,1323,3123\})\\ &=p_3^2(1+2(p_2+p_1)+6p_2p_1)\\ \end{align}$$

$n=4:$ $$\begin{align} \mathbb{P}(Z_4=1)&=p_1^2(1+2(p_2+p_3+p_4)+6(p_2p_3+p_2p_4+p_3p_4)+24p_2p_3p_4)\\ \mathbb{P}(Z_4=2)&=p_2^2(1+2(p_1+p_3+p_4)+6(p_1p_3+p_1p_4+p_3p_4)+24p_1p_3p_4)\\ \mathbb{P}(Z_4=3)&=p_3^2(1+2(p_2+p_1+p_4)+6(p_2p_1+p_2p_4+p_1p_4)+24p_2p_1p_4)\\ \mathbb{P}(Z_4=4)&=p_4^2(1+2(p_2+p_3+p_1)+6(p_2p_3+p_2p_1+p_3p_1)+24p_2p_3p_1)\\ \end{align}$$

NB: It is readily verified that, for each case of $n$, the probabilities $\mathbb{P}(Z_n=1),\ldots,\mathbb{P}(Z_n=n)$ do indeed lie in the interval $[0,1]$ and sum to $1$.

Thus, defining

$$\begin{align}f(p_1,...,p_n):= p_1^2\left(1!+2!\sum_\limits{1<i_1\le n}p_{i_1} + 3!\sum_\limits{1<i_1<i_2\le n}p_{i_1}p_{i_2}+4!\sum_\limits{1<i_1<i_2<i_3\le n}p_{i_1}p_{i_2}p_{i_3} +\ldots+ n!\sum_\limits{1<i_1<\cdots<i_{n-1}\le n}p_{i_1}\cdots p_{i_{n-1}}\right)\end{align}$$ we see that for each $k\in S=\{1,2,3,\ldots,n\}$, $$\begin{align}\mathbb{P}(Z_n=k)=f(p_{j_1},...,p_{j_n}) \end{align}$$ where $(j_1,\ldots,j_n)$ is just $(1,2,\ldots,n)$ with $1$ and $k$ transposed.

That is, the distribution of $Z_n$ is given by the following $n$ probabilities: $$\begin{align} \mathbb{P}(Z_n=1)&=f(p_1,p_2,p_3,p_4,\ldots,p_n)\\ \mathbb{P}(Z_n=2)&=f(p_2,p_1,p_3,p_4,\ldots,p_n)\\ \mathbb{P}(Z_n=3)&=f(p_3,p_2,p_1,p_4,\ldots,p_n)\\ &\ldots\\ \mathbb{P}(Z_n=n)&=f(p_n,p_2,p_3,p_4,\ldots,p_1). \end{align}$$

NB: In the above formula for $f(p_1,...,p_n)$, the $k$th term (with $k=1,...,n$) is $$k!\sum_\limits{1<i_1<\cdots<i_{k-1}\le n}p_{i_1}\cdots p_{i_{k-1}}, $$ in which the summation has $\binom{n-1}{k-1}$ summands, each being a product of $k-1$ probabilities corresponding to choosing $k-1$ of the $n-1$ indices $2,\ldots,n.$


Example: $Y_i\sim \text{Uniform}\{1,\ldots,n\}.$ In this case, $p_k={1\over n}$ for $k=1,\ldots,n,$ hence $$\begin{align}f(p_1,...,p_n) &={1\over n^2}\left(1!+2!\binom{n-1}{1}{1\over n} + 3!\binom{n-1}{2}{1\over n^2}+\ldots+ n!\binom{n-1}{n-1}{1\over n^{n-1}}\right)\\ &={1\over n^2}\sum_\limits{k=1}^n k \prod_\limits{j=1}^{k-1}\left(1-{j\over n}\right)={1\over n^2}\,n\\ &={1\over n} \end{align}$$

where we have used Wolfram Alpha to evaluate the indicated sum-of-products.

Hence, in this case the distribution of $Z_n$ is also $\text{Uniform}\{1,\ldots,n\}$ (as intuitively expected).


To do: Find an expression for $\mathbb{P}(Z_\infty=k)$, i.e. for the case of countably infinite support.