Why would square rooting the size of the alphabet decrease the storage size of a trie by 8 and not 16?

47 Views Asked by At

From wikipedia's article on trie's:

There are several ways to represent tries, corresponding to different trade-offs between memory use and speed of the operations. The basic form is that of a linked set of nodes, where each node contains an array of child pointers, one for each symbol in the alphabet (so for the English alphabet, one would store 26 child pointers and for the alphabet of bytes, 256 pointers). This is simple but wasteful in terms of memory: using the alphabet of bytes (size 256) and four-byte pointers, each node requires a kilobyte of storage, and when there is little overlap in the strings' prefixes, the number of required nodes is roughly the combined length of the stored strings.[2]:341 Put another way, the nodes near the bottom of the tree tend to have few children and there are many of them, so the structure wastes space storing null pointers.[12]

> The storage problem can be alleviated by an implementation technique called alphabet reduction, whereby the original strings are reinterpreted as longer strings over a smaller alphabet. E.g., a string of n bytes can alternatively be regarded as a string of 2n four-bit units and stored in a trie with sixteen pointers per node. Lookups need to visit twice as many nodes in the worst case, but the storage requirements go down by a factor of eight.[2]:347–352

I must be making some simple math mistake: The 8 bit alphabet has $2^{8n}$ bottom nodes with $2^{8}*4$ bytes storage each. The 4 bit alphabet also has $2^{4*2n}$ bottom nodes with $2^{4}*4$ bytes storage each. Why does wikipedia say a factor of 8 storage difference and not 16?

1

There are 1 best solutions below

0
On BEST ANSWER

When your quote speaks about "bottom nodes" it doesn't mean leaf nodes -- just nodes that are sufficiently far down the tree that most of them have only a single child.

If you switch from an 8-bit alphabet to a 4-bit alphabet, each node now contains an array of 16 child pointers rather than an array of 256 child pointers. That's a reduction by a factor of 16 -- but on the other hand, you now need twice as many nodes to represent strings of the same length. That halves the savings.