Computing Minimum Description Length

46 Views Asked by At

I have two integer arrays, x and y:

x = np.array([[1, 2, 0, 12,  4],
              [5, 2, 1, 10, 12]]
            )

y = np.array([[1, 2, 0, 11,  4],
              [5, 3, 0, 10, 15]]
            )

And I would like to use x to compress/compute the description length (in units of "bits") for y and then compare the number of "bits saved" as a result of the compression. Given that our data is amll, we'll simply use n_bits = 8 (8 bits) to store each integer. In the uncompressed case, it takes a total 2 x 5 x 8 = 80 bits to store y (i.e., DL(y) = 80). Similarly, DL(x) = 80. Now, let's assume that x is the best possible "model"/"hypothesis" for compressing y, then according to the MDL framework:

DL(y, x) = DL(y|x) + DL(x)

Where DL(x) are the number of bits needed to store x and DL(y|x) is the residual bits of y given x:

residual = x - y

array([[ 0,  0,  0, -1,  0],
       [ 0,  1, -1,  0,  3]])

So, what is DL(y|x) for this residual array? According to some examples that I've come across (that I don't fully understand), DL(y|x) can be computed by first identifying the number of unique values in the residual

n_bits = 8
n_unique = len(np.unique(residual))  # 4
DL_residual = 2 * 5 * np.log2(n_unique) + n_unique * n_bits  # 52 bits

If I understand correctly, since n_unique = 4 (i.e., the cardinality of the residual is 4), then it looks like 2 * 5 * np.log2(n_unique) is accounting for the number of bits to store the residual. However, I have no idea why n_unique * n_bits is needed (maybe it is not??). Naively, I would've assumed that 2 * 5 * np.log2(n_unique) was sufficient.

I don't even know if this even the right way to compute the description length for the residual and, ultimately, I need to figure out what the description length of the residual is.