I am currently working on a Huffman data compression algorithm and struck a problem when trying to optimize space-efficiency by changing the set of symbols the algorithm uses.
I will give more context, but if you only care about the mathematical problem itself, skip this next section.
Context
I found that given a file and a complete set of symbols, it seems that making an algorithm that takes the set of symbols and the file and spits out what the exact compressed file size would be when pushed through the Huffman Compression algorithm cannot be done without actually putting the file through the Huffman Compression algorithm itself or by just giving a rough estimation that is closer to a guess than anything else. This would not be a problem if for a complete sets of symbols, running the program did not take hours for a 100Mb file.
The problem comes down to that to have a time-efficient algorithm, you cannot guarantee be sure that it is very space-efficient. However, it is certain that the most space-efficient algorithm will be very time-inefficient relative to the time-efficient algorithm. One solution to this trade-off is to make this algorithm pick a complete set of symbols such that it produces the most space-efficient algorithm for a file its size on average. Since this calculation of probability is relatively short, this gives a relatively time-efficient algorithm that produces a more optimized Huffman compression on average.
I have found a way to describe this calculation step in a mathematical way, but I am struggling with finding its solution.
Here is the problem:
Let S be a set of y unique symbols.
Let b be a random sequence of symbols from S that is x symbols long such that x $\ge$ y. (symbols can repeat in b)
Finally, let T be the set of unique symbols used in b.
Given x and y, what is the probability that the number of elements in T is less than some z $\in \mathbb{Z}$, y $\ge$ z $\ge$ 2.
Simple Example
It could be that S has 3 unique symbols and b is 3 characters long. For argument sake, lets say that S = {A, B, C}. If we were to ask what is the probability that the number of unique symbols used in b is 1 then either b = AAA, BBB, or CCC.
Of course, there are 27 different 3-symbol long words we can made with S. Therefore, the probability that b consists of only 1 symbol from S is $\frac{1}{9}$. In this case, z = 2
Final Remarks
I have tried more complicated examples, but the difficulty of the combinatorics logic seems to grow very complicated very quick. Is there a general solution to this problem given x, y, z?
According to this answer to a related question the number of $b$ sequences with exactly $k = |T|$ unique symbols is:
$$\binom{y}{k}\sum_{j = 0}^{k} (-1)^j\binom{k}{j}(k - j)^x$$
We had to multiply by $\binom{y}{k}$ because that is the number of ways to choose an alphabet of $k$ characters from $S$.
And all sequences with $k = |T| < z$ are:
$$\sum_{k=1}^{z-1}\binom{y}{k}\sum_{j = 0}^{k} (-1)^j\binom{k}{j}(k - j)^x$$
therefore the probability you are looking for is:
$$\frac{\sum_{k=1}^{z-1}\binom{y}{k}\sum_{j = 0}^{k} (-1)^j\binom{k}{j}(k - j)^x}{y^x}$$
I made some numerical tests for some examples and it results in $1$ as expected when $z = y + 1$, but please verify yourself too or wait for someone else's confirmation.
I will further investigate if the above formula can be simplified.
Edits
The last part of this answer confirms the result.
Looking for the formula with Approach Zero I didn't find any question where an expression like that has been simplified, so I suppose it isn't possible.
Note that:
$$\sum_{j = 0}^{k} (-1)^j\binom{k}{j}(k - j)^x=k!{x \brace k}$$
where ${x \brace k}$ are Stirling numbers of the second kind, and $k!{x \brace k}$ is at OEIS A019538.