The number of combinations of Fi to F255 that satisfies the following inequality.

119 Views Asked by At

Consider the following equation:

$H(x) = - \sum_{i=0}^{255} P_{i} \ \log_{2} \ (P_{i}) $

When $ P_{0} = P_{1} = ... =P_{255} = \frac {1}{256}$

$H(x) = - \sum_{i=0}^{255} \frac {1}{256} \ \log_{2} \ (\frac {1}{256}) = 8 $

Where $p_i$ is the probability of the character $x_i$ to occur in the file.

The file Size is 256 characters.

We will refer to the frequency of the character $x_i$ as $f_i$. ($p_i$ = $f_i / 256)$.

The sum of the frequencies of the characters will always be $256$.

My question is:

Given that $H(x)$ is $\ge 7$, How to calculate the number of possible combinations of characters frequencies (from $f_{0}$ to $f_{255}$) that makes the inequality true?

1

There are 1 best solutions below

0
On BEST ANSWER

You want

\begin{eqnarray*} H&=&-\sum_ip_i\log_2p_i\\ &=& -\sum_i\frac{f_i}{256}\log_2\frac{f_i}{256}\\ &=&-\frac1{256}\sum_i\left(f_i\log_2f_i-f_i\log_2256\right)\\ &=&8-\frac1{256}\sum_if_i\log_2f_i\\ &\ge&7 \end{eqnarray*}

and thus

$$ \sum_if_i\log_2f_i\le256\;, $$

or (with $0^0=1$)

$$ \prod_if_i^{f_i}\le2^{256}\;, $$

which is a good form of the condition for computer purposes, as it allows us to do everything with integer computations without rounding errors.

This Java code performs the computation.

The number of admissible frequency assignments with the characters treated as indistinguishable is $83497348$.

The number of admissible frequency assignments with the characters treated as distinguishable is

24377117336293373068077194760018608890164075844989457788709896790123975565346885123594568979784410896832589991747589003175962663714550060904409876659.

The number of admissible files is

32294749681296695275407031178128991719155449234352428397819516042408274669320014204971984810980113839713404991542286248511856534533725621770720443204257942757841685726869445418911988343694467610597716668644890003072566046313323943032907344235616323318993946853455308766090267464412982787428679760943574604833815718162734374857821826072728560557821111484218984087078219375690986647992988689743586158787866836166182263828747140499381872352135857955196142673766290736723267626340276801993482047192554123014017270771149942354934825395952764480929741003537616825060105246264524800000000000000000000000000000000000000000000.

There are $256^{256}$ files in total, so approximately $99.93\%$ of the files fulfil the condition.

The results should be reasonably reliable, as the total number of files is correctly reproduced when the entropy condition is omitted, and this depends on all three counts being performed correctly.

Here's a $\log_2$-plot of the proportion of files with $n$ characters chosen from $n$ symbols for which

$$ \prod_if_i^{f_i}\gt2^n\;, $$

that is, the proportion of inadmissible files. The line is a linear fit to the last $40$ data points, with slope approximately $-0.0317$. I don't know how this constant could be calculated. Note the interesting odd/even structure. (Click on the image to see the full high-resolution version.)

log_2 plot of the proportion of inadmissible files