Compute Number of Bits Used Without Building Huffman Tree

94 Views Asked by At

Say, I have a list of integers x = [102, 102, 102, 1, 0, -1, -2000, -3, 3, 102, 1, 0, -1, -1, -2000, -2000, -2000, -2000, -1, -1, 0, 0, 1, 3] and the frequencies for each number is:

Integer  Count
    102      4
      3      2
      1      3
      0      4
     -1      5
     -3      1
  -2000      5

If I generate a Huffman tree and then encode this sequence, I find that the total number of bits necessary for encoding this list is 65 bits. In theory, the order of the integers shouldn't matter (i.e., I can shuffle this list of integers) as it should still produce the same number of bits (65).

Considering that I only want to compute the number of bits and I don't care about the actual encoding or the tree, is there a faster/better way to compute the number of bits used (i.e., 65) without having to explicitly construct the Huffman tree? Perhaps, it may be possible to generate the bit count based on the frequency count alone?

1

There are 1 best solutions below

5
On

I believe that you are looking for the binary entropy -- the average number of bits required to encode your symbols. To get that, you divide the entropy in nats by log(2.0). I get an answer of 2.66937 bits per symbol. Multiply that by 24 symbols, and you get an answer of 64.0649 bits. Take the ceiling of that.

#include <iostream>
#include <map>
#include <string>
#include <vector>
using namespace std;


int main(void)
{
    vector<long signed int> input = {102, 102, 102, 1, 0, -1, -2000, -3, 3, 102, 1, 0, -1, -1, -2000, -2000, -2000, -2000, -1, -1, 0, 0, 1, 3};
    
    size_t length = input.size();

    double entropy = 0.0;

    if (length > 0)
    {
        map<long signed int, size_t> input_map;

        for (size_t i = 0; i < length; i++)
            input_map[input[i]]++;

        for (map<long signed int, size_t>::const_iterator ci = input_map.begin(); ci != input_map.end(); ci++)
        {
            double probability = ci->second / static_cast<double>(length);

            cout << "Number: \'" << ci->first << "\', Count: " << ci->second << ", Probability: " << probability << endl;

            entropy += probability * log(probability);
        }

        entropy = -entropy;
    }

    cout << "Entropy: " << entropy/log(2.0) << endl;

    return 0;
}