Analysis of a stochastic process for shuffling an ordered list

42 Views Asked by At

Recently I needed a way to shuffle an ordered list with a controllable degree of randomness. To state it more formally, given an ordered list $\mathcal{X}=(x_1,x_2,\dots,x_n)$ and a number $r\in[0,1]$, the goal is to produce another ordered list $\mathcal{Y}=(y_1,y_2,\dots,y_n)$, with each $x_i$ appearing exactly once in $\mathcal{Y}$, such that the expected correlation$^*$ between $\mathcal{X}$ and $\mathcal{Y}$ is $r$. In particular, note that if $r=0$, then we should have $\mathcal{Y}=\mathcal{X}$, and if $r=1$, then $\mathcal{Y}$ should be a completely random shuffle of $\mathcal{X}$.

One possible solution is to produce each of the possible $n!$ reorderings of $\mathcal{X}$ and pick the one that gives a correlation closest to $r$. This obviously scales badly in terms of computational complexity, so I would be happy with a stochastic algorithm that only gives a correlation of $r$ on average.

After some thinking, I came up with the following idea for such an algorithm:

Set $\mathcal{X}'=\mathcal{X}$ and go fetch your favorite biased coin with $\Pr(T)=p$. Then do the following to set $y_1$ to some $x_i$. Start at the first element of $\mathcal{X}'$ and toss; if a $T$ comes up, skip this element, move on to the next element, and toss again. Keep skipping elements as long as a $T$ comes up; if you run out of elements cycle back to the first one and keep going. As soon as a $H$ comes up, set $y_1$ to the current element and remove that element from $\mathcal{X}'$. Then choose $y_2$ following the same procedure, and so on until you have chosen $y_n$.

It's easy to see that this algorithm produces the desired outcomes at the extremes of $p$:

  • If $p=0$ it deterministically sets $y_j=x_j$ for each $j$, and so $\mathbb{E}[\text{Corr}[\mathcal{X},\mathcal{Y}]]=1$.

  • As $p\rightarrow 1$, the expected number of skips becomes much larger than the length of $\mathcal{X}'$ (which is at most $n$), such that the bias introduced by starting at the first element of $\mathcal{X}'$ diminishes to $0$, and each $y_j$ is set to a random $x_i$. Therefore, $\mathbb{E}[\text{Corr}[\mathcal{X},\mathcal{Y}]]\rightarrow 0$.

From these observations, intuition suggests that as $p$ goes from $0$ to $1$, the expected correlation should decay monotonically from $1$ down to $0$. Mote-Carlo simulations confirm this empirically. The remaining burning question, however, is:

Given a desired expected correlation $r$, what value should $p$ be set to?

*Defined as $\text{Corr}[\mathcal{X},\mathcal{Y}] = \text{Cov}[\mathcal{X},\mathcal{Y}]/(\sigma_\mathcal{X} \sigma_\mathcal{Y})$.


Below I explain what I've tried and where I got stuck.

I attempted to obtain the whole probability distribution, i.e. compute $\Pr(y_j=x_i)$ for each pair $(i,j)$. This is easy for $y_1$, because $\mathcal{X}'=\mathcal{X}$. To assign $y_1$ to $x_i$ the first time round, we must get $(i-1)$ skips followed by one selection. This happens with probability $p^{i-1}(1-p)$. If we don't get a selection on $x_i$ the first time round, we could get it the second time round, after we've performed an additional $n$ skips. This happens with probability $p^n p^{i-1} (1-p)$. Or we could get it on the third attempt, or the fourth, and so on. Summing up:

$$ \Pr(y_1=x_i) = p^{i-1}(1-p)\sum\limits_{k=0}^{\infty} p^{kn} = \frac{1-p}{1-p^n}p^{i-1} \quad i\in\{1,2,\dots,n\} $$

For convenience let's define: $$ F(n,i)=\frac{1-p}{1-p^n}p^{i-1} $$

so we have $\Pr(y_1=x_i) = F(n,i)$.

Great, now on to computing $\Pr(y_2=x_i)$. This is clearly conditioned on what happened when choosing $y_1$; say $y_1=x_{\alpha_1}$. Note that now $|\mathcal{X}'|=n-1$. We have three possible cases:

  • If $\alpha_1 < i$, then $x_i$ is now the $(i-1)^\text{th}$ element in $\mathcal{X}'$, so $\Pr(y_2=x_i | \alpha_1<i) = F(n-1,i-1)$.
  • If $\alpha_1=i$ then trivially $\Pr(y_2=x_i | \alpha_1=i) = 0$.
  • If $\alpha_1>i$, then $x_i$ is still the $i^\text{th}$ element in $\mathcal{X}'$ so $\Pr(y_2=x_i | \alpha_1>i) = F(n-1,i)$.

Since we have exhausted all the cases, we can now marginalize over $\alpha$ to obtain the unconditional probability $\Pr(y_2=x_i)$. Note that we can obtain $\Pr(\alpha_1<i)$, $\Pr(\alpha_1=i)$, and $\Pr(\alpha_1>i)$ from the distribution of $y_1$, but we have to compute the relevant finite sums for the inequality cases. Tedious, but easy enough.

However, hopefully you can now see the problem we're going to run into when computing the distributions for higher values of $y_j$. $\Pr(y_j=x_i)$ is conditioned on what happened when choosing all of $y_1,y_2,\dots,y_{j-1}$, and the number of cases to exhaust blows up very quickly. Even worse, the number of possible cases depends on $i$.

Is this simply intractable, or is there some clever way around it, or at least a reasonable approximation? Or perhaps some completely different way of looking at the problem?

1

There are 1 best solutions below

0
On

Since you said you would like approaches that are different from yours on generating correlated strings (where we measure the correlation by a function of normalized Hamming distance between the two strings), here is one algorithm:

\\\

Algorithm:

For every number $i \in [n]$, toss a coin $C_i$ that has bias $p$; if heads, add $i$ to a set $S$. If tails, let $y_i = x_i$

Let $\sigma$ be a permutation chosen uniformly at random for the elements in $S$, and let $y_i = x_{\sigma (i)}$

\\\

Now, we just do the usual analysis, though for convenience, let $A$ be the random variable that denotes the number of normalized agreeing coordinates. (I am using this so that this behaves like some notion of normalized correlation; if it is 1, the strings are identical, and if it is 0 they are completely different)

$\mathbb{E} [A] = \mathbb{E} [\frac{\sum 1_{x_i = y_i}}{n}] = \mathbb{E}[ 1_{x_1 = y_1}]$ by symmetry.

Note that $\mathbb{E} [1_{x_1 = y_1}] = \mathbb{E}[ 1_{x_1 = y_1} \mid C_i = H]P(C_i = H) + \mathbb{E}[ 1_{x_1 = y_1} \mid C_i = T]P(C_i = T) $ by law of total probability, which then equals $1-p + p \mathbb{E}[ 1_{x_1 = y_1} \mid C_i = T]$.

$\mathbb{E}[ 1_{x_1 = y_1} \mid C_i = T]$ is a bit more interesting; let the total number of heads in $\{2,...,n\}$ be $N$, we get:

$\mathbb{E}[ 1_{x_1 = y_1} \mid C_i = T] = \mathbb{E}[ \mathbb{E} [1_{x_1 = y_1} \mid C_i = T, N ]\mid C_i = T] = \mathbb{E} \frac{1}{1 + N}.$ Now, note that $N$ is binomial with parameters $n-1$ and $p$, so we get from Find the expected value of $\frac{1}{X+1}$ where $X$ is binomial that $\mathbb{E} \frac{1}{1 + N} = \frac{1}{np}\cdot(1-(1-p)^{n})$, so combining all above we get:

$\mathbb{E}[A] = (1-p) + \frac{1}{n}\cdot(1-(1-p)^{n})$. Now, given an $r, n$, you can solve for $p$. (I would probably let a computer numerically solve it, and note that $r$ will be arbitrarily close to $1-p$ for sufficiently large $n$)