Why does the distribution of entropy in a deck of shuffled cards not follow a normal distribution?

95 Views Asked by At

Around a year ago I did a study on shuffling cards and came across an unexpected result. I was investigating methods for quantifying how "shuffled" a deck of cards was, such as measuring the number of rising sequences in the deck or the bits of entropy of the deck.

Firstly I performed a perfectly random shuffle on a deck of cards using the Fischer-Yates method. This shuffle results in all permutations of the deck being equally likely to occur. Imagine you have a deck of sorted cards. One by one randomly pick cards from this deck and add them to a pile, slowly forming a new deck. Once this is finished you have a perfectly randomized deck.

Then to measure how "shuffled" this deck was I used several different measurements, of which I'll highlight 2. First, labeling cards $1-52$, a rising sequence is any case where the values of these cards continue to increase by a value of $1$ as you progress through the deck. For example, $12345678$ only has one rising sequence, while $12364578$ has two and $13254687$ has four.

Second, I calculated the entropy of the deck. To calculate this first assign cards in the deck numbers $1-52$. Say $k[x]$ gives the value of a card in the shuffled deck at position $x$. Say $d[i] = k[i+1] – k[i]$. Find all values of $d[i]$ for $0 <= i <= 52$ (for $52$ calculate the difference between $i=52$ and $i=0$). If any value of $d[i]$ is less than $0$, then add $52$ to bring it back into the positive range. Now you should have a list of $52$ difference values between $1$ and $51$. Take these values and create a normalized histogram of them (by dividing the frequency of each difference by $52$). Using these histogram values, calculate the Shannon entropy using the formula:$$E = \sum_{k=1}^{52} -p_k\log_b(p_k) $$ where $p_k$ gives the normalized histogram value. This process is fairly complicated, and I ensured I correctly coded it by running many tests against online entropy calculators where I got identical results. For all my calculations I used a log base value of $2$, so I calculated bits of entropy.

Now, I coded a program to simulate all of this and calculate my number of rising sequences and entropy. I shuffled 100 million decks using the Fischer-Yates method, and for every deck I calculated the number of rising sequences and entropy. I created a distribution of these results: enter image description here enter image description here

As you can see, the distribution of rising sequences follows a normal distribution, but the distribution of entropy values appears random. Why does this distribution not follow a normal distribution? Is this some property of entropy, or did I make a mistake in my program? I tested my entropy code against many online calculators and got identical results, so I don't think that is a problem. I also am using double values in my code, so they should be accurate to around 15 decimal places, far more than could account for the graph above. I also ran the same test using a different shuffling algorithm (swapping random pairs of cards 250 times), and the distribution was identical.

If you'd like to read my full study results you can do that here. I can also share the code if anyone wants to pick it apart for errors.

1

There are 1 best solutions below

0
On
  1. Definition: A sequence of random variables $(X_n)$ is said to be asymptotically Normal just if there exist real (nonrandom) sequences $(a_n), (b_n),$ such that $\lim_{n\to\infty}\mathbb P\left({X_n-a_n\over b_n} \in A \right)=\mathbb P\left(Z \in A \right)$ for any interval $A\subseteq\mathbb R$, where $Z$ is a standard Normal random variable.

  2. Simulations strongly suggest that the entropies $(E_n)$ are asymptotically Normal: distribution of entropy for various n

    Here "standardized entropy" is just ${(E_n -\hat \mu_n)/\hat\sigma_n}$ in a sample of size $N$, with $\hat\mu_n, \hat\sigma_n$ being the sample mean and sample standard deviation (which are found to have effectively converged when $N=10^7$). The green-outlined histogram is just a standardized version of the OP's histogram.

  3. Each of the three histograms shown above is effectively the limit distribution for the corresponding fixed value of $n$ (the number of cards). (Empirically we observe that samples of size greater than $N=10^7$ result in essentially the same histograms.) These histograms show that it's not until $n$ is well-above $100$ that the distribution of $E_n$ becomes "approximately" Normal.

  4. Asymptotic Normality would be implied by a Central Limit Theorem that "almost applies" here: A paper by Morris[1975 ], “Central Limit Theorems for Multinomial Sums.”, proves a CLT that would apply here if the adjacent-card differences were i.i.d. Uniform. (Empirically, these differences do appear to be asymptotically Uniform$\{1,...,n-1\}$.) Note that such a special kind of CLT is required because, in addition to the observed frequencies being mutually dependent, also the number of population categories ($n-1$) itself depends on $n$. (Classical results are for a fixed number of categories while $n\to\infty$.)

  5. $E_n$ is a simple function of the well-known Likelihood Ratio Statistic ($G_n$) for testing $H_0:\pi_1=...=\pi_k=1/k$ in Multinomial populations when $k$ depends on $n$; e.g., if $k=n-1$, then $$E_n:= -\sum_{j=1}^{n-1} f_j\log_2f_j= \log_2 (n-1) - {G_n\over 2 n \ln 2}$$ where $$G_n:=2\sum_{j=1}^{n-1} N_j\ln{N_j\over n/(n-1)},\quad N_j=nf_j.$$ It's also interesting to note a connection to Pearson's Chi-square statistic ($X_n^2$, often used for testing goodness-of-fit): $$G_n\approx X_n^2 :=\sum_{j=1}^{n-1}{(N_j-n/(n-1))^2\over n/(n-1)}. $$