Entropy of shuffled sequence split into two groups

289 Views Asked by At

Let it be a sequence of $N$ elements that can be ordered. At each draw, all elements from the sequence are shuffled. Example with $N = 8$:

$draw_1 = (e_4, e_3, e_0, e_6, e_1, e_7, e_2, e_5)\\ draw_2 = (e_7, e_2, e_0, e_1, e_8, e_3, e_6, e_4)\\ draw_3 = (e_4, e_0, e_3, e_2, e_6, e_5, e_7, e_1)\\ ...$

Within a sequence, elements can be pairwise compared. Since there are $N!$ different orderings, each draw has an entropy of $log_2(N!)$ bits.

Now, let's modify the system such that there is two sequences of $N/2$ elements. At each draw, elements are shuffled and half of them goes to the first sequence and the other half to the second sequence. Example with $N=8$:

$ draw_1 = \{(e_4, e_3, e_7, e_0), (e_2, e_1, e_5, e_6)\}\\ draw_2 = \{(e_6, e_3, e_1, e_7), (e_5, e_0, e_2, e_4)\}\\ draw_3 = \{(e_7, e_1, e_5, e_4), (e_0, e_6, e_2, e_3)\}\\ ... $

What is the Shannon entropy of such system if the elements can only be pairwise compared between the two sequence and not within the same sequence? For example, let's define the ordering as $e_n$ < $e_{n+1}$. In $draw1$, $e_4$ is compared to $e_2$, $e_1$, $e_5$ and $e_6$; $e_3$ is compared to $e_2$, $e_1$, $e_5$ and $e_6$ etc. We can use a bit sequence to describe the draw, for example $draw_1=$1100110011110000. I am looking for the entropy of such a sequence.

The full context of this question is described in this crypto.stackexchange question.

2

There are 2 best solutions below

6
On

Note: I made a logical error, and computed the entropy assuming all microstates were equally likely. Furthermore, I think the approximations I made are a bit too crude to be useful in the range where OP is interested. I leave this answer up since it has some useful ideas, even though it is incomplete and innacurate.


Answer: The entropy of your modified draw sequence has asymptotically the same entropy as the regular draw sequence; both are $\sim \log_2 N!\sim N\log_2 N$.

Let $n=N/2$.

Each draw determines an $n\times n$ binary matrix $M$, where $M_{i,j}=1$ if the $i^{th}$ element of the first sequence is greater than the $j^{th}$ element of the second sequence, and $M_{i,j}=0$ otherwise. You want to count the number of binary matrices you can get.

Define a $2\times 2$ submatrix of $M$ to be the result of selecting two rows of $M$ (not necessarily adjacent) and two columns of $M$, and looking at the four entries of $M$ at the intersection of these rows and columns. Note that our matrix $M$ cannot have any $2\times 2$ submatrices which look like either $$ \begin{bmatrix}1&0\\0&1\end{bmatrix} \qquad\text{or}\qquad \begin{bmatrix}0&1\\1&0\end{bmatrix} $$ since this would imply there were entries $a_i,a_j$ in the first sequence and $b_k,b_h$ in the second sequence for which $a_i<b_k<a_j<b_h<a_i$, a contradiction. On the other hand, you can show that every matrix which avoids these submatrices can be constructed from one of your draws. Therefore, counting draws is equivalent to counting matrices avoiding these submatrices.

It turns out that the number of such matrices is $$ a_n:=\sum_{j=0}^n (j!)^2{n+1\brace j+1}^2 $$ where ${m \brace k}$ is the $(m,k)^{th}$ Stirling number of the second kind, equal to the number of ways to partition a set of size $m$ into $k$ disjoint subsets. For a proof, see the computation following Theorem 2.1 in https://arxiv.org/abs/1103.4884.

The sequence $a_n$ appears in https://oeis.org/A048163, and Vaclav Kotesovec claims $\lim_{n\to\infty}(a_n/n!)^{1/n}/n=C$, where $C=\frac1{e(\log 2)^2}\approx 0.7656928576\dots$ Therefore, $$ \log a_n\sim \log n! +n\log n+n\log C\tag{$*$} $$ Using Stirling's approximation, we get that $$ \log N!\sim N \log N, $$ while $$ \require{cancel} \log a_{N/2}\sim \log (N/2)! + (N/2)\log (N/2)+\cancelto{\approx 0}{\color{gray}{(N/2)\log C}}\sim N\log (N/2)\sim N \log N $$

That is, both experiments have an asymptotic entropy of $N\log N$, so your restriction with the ordering only mattering between the two halves of the draw has negligible effect on the entropy. I am only keeping track of the dominant terms, but if necessary you can use $(*)$ to get a more precise estimate of the difference between $\log_2 N!$ and $\log_2 a_{N/2}$.

0
On

(This summarizes some direct enumeration computations that may supplement an analytical answer.)

Let $Y$ be the bitstring outcome of the OP's second procedure, with $N=2m$. Then $Y$ is the result of concatenating $m^2$ indicator functions as follows: $$Y=\text{Concat}_{i=1}^m \text{Concat}_{j=m+1}^{2m}\,1_{x_i>x_j},$$ where $X=(x_1,...,x_{2m})$ is uniformly distributed on the set of $(2m)!$ permutations. (E.g., The OP's "draw1" with $m=4$ is $X=[4, 3, 7, 0, 2, 1, 5, 6] \mapsto Y=1100110011110000$.)

Now, the (Shannon) entropy of $Y$ is
$$H(Y)=-\sum_{i=1}^{a_m}p_i\log_2 p_i=-\sum_{i=1}^{a_m}{n_i\over (2m)!}\log_2 {n_i\over (2m)!}=-\sum_{j=1}^{b_m} k_j\,{r_j\over (2m)!}\log_2 {r_j\over (2m)!},\tag{1} $$ where $a_m$ is the number of distinct elements (bitstrings) in the support of $Y$, $p_i=P(Y=i\text{th element in the support of $Y$})={n_i\over (2m)!}$ (taking the elements in some fixed order), and $n_i$ is the number of permutations mapping to the $i$th bitstring (i.e. the "multiplicity" of that bitstring).

The RHS of (1) denotes the result of summing over the $b_m$ distinct multiplicities $r_j$ that occur among the $n_i$, there being $k_j$ cases of $n_{i}=r_j$.

NB: As mentioned in another answer, the sequence $(a_m)$ is listed as OEIS #A048163, and from comments there it can be deduced that $\log_2 a_m\sim \log_2 (2m)!$.

Following is a summary for $1\le m\le 5$, obtained by direct enumeration:

m  a_m     H(X)   H(Y)   Y-distribution (r_1^k_1 r_2^k_2 ... r_b^k_b)
-- -----   -----  -----  ---------------------------------------------
1  2       1      1      1^2
2  14      4.585  3.585  1^8 2^4 4^2
3  230     9.492  7.258  1^72 2^72 4^72 12^12 36^2
4  6902    15.30  11.74  1^1152 2^1728 4^2592 8^432 12^576 16^72 24^192 36^128 96^12 144^16 576^2
5  329462  21.79  16.86  1^28800 2^57600 4^115200 8^43200 12^28800 16^16200 24^21600 36^7200 48^3600 72^2400 96^1800 144^2000 288^800 576^200 1440^40 2880^20 14400^2

Here $r_1<r_2<\ldots<r_{b_m}$ denote the distinct multiplicities of the $Y$-values, and the shorthand notation $r_j\text{^}{k_j}$ indicates $k_j$ occurrences of a the same multiplicity $r_j$, giving the following structure for the probability distribution of $Y$ : $$\left(\underbrace{{r_1\over(2m)!},\ldots,{r_1\over(2m)!}}_{k_1 \text{ equal probabilities}},\ \underbrace{{r_2\over(2m)!},\ldots,{r_2\over(2m)!}}_{k_2 \text{ equal probabilities}},\ \underbrace{{r_3\over(2m)!},\ldots,{r_3\over(2m)!}}_{k_3 \text{ equal probabilities}},\ldots \ldots,\ \underbrace{{r_{b_m}\over(2m)!},\ldots,{r_{b_m}\over(2m)!}}_{k_{b_m} \text{ equal probabilities}}\right). $$

(It would be nice to have a formula for $b_m$; unfortunately, there are $46$ sequences in OEIS starting with these same first five values of $(b_m)_{m=1}^\infty=(1,3,5,11,17,\ldots)$, and it's computationally intensive to find more than the first five.)

The notation is such that $\sum_{j=1}^{b_m} k_j = |\text{support}(Y)| = a_m$ and $\sum_{j=1}^{b_m} k_j r_j = |\text{support}(X)| = (2m)!$.

E.g., for $m=2$, the multiplicity $r_1=1$ occurs $k_1=8$ times, multiplicity $r_2=2$ occurs $4$ times, and multiplicity $r_3=4$ occurs $2$ times, so the probability distribution has the structure $({1\over 4!},{1\over 4!},{1\over 4!},{1\over 4!},{1\over 4!},{1\over 4!},{1\over 4!},{1\over 4!},{2\over 4!},{2\over 4!},{2\over 4!},{2\over 4!},{4\over 4!},{4\over 4!})$, whose entropy is then found to be $3.585...$.

For this range of $1\le m\le 5$, $H(Y)\approx m\log_2(2m)$:

Shannon entropy