Best way to quantify amount of "information" in a matrix?

1.1k Views Asked by At

Given a matrix $X$ of dimensions $n*d$ my goal is to determine the amount of resources used by this matrix. The definition for resources used here is flexible, but relates to the amount of information encoded in the matrix. That is, a a completely uniform matrix uses very little resources as it can be compressed quite easily.

This relation between resources used and information contained reminded me of entropy. However, my field of specialization is computing hence I'm wondering if there's some better concepts I'm missing. The problem with calculating entropy is that it requires the distribution of the random variable $X$ (here my matrix entries) however I do not have access to such a distribution, nor can I reasonably build one given the single matrix (in the context of my problem).

I also thought of SVD, and how information contained can be related to the variance of the data, however I moved away from that solution as measuring the amount of resources used would require me to know the true distribution (ie. the mean and variance when 100% of information is conserved) which I do not have.

What would be the best way to quantify the amount of resources used, or the amount of information used by/contained in a matrix?

Edit: Some context around my problem. I am developing a machine learning algorithm which seeks to build the optimal architecture for a neural network given some resource constraint. In my setting the resource constraint is amount of information encodable, equivalently, space available for the model to use in memory. To do this however I need to calculate the amount of resource the model uses. It isn't sufficient for me to say it's taking a certain amount of memory because that tells me nothing about the actual content of the information. Information in a neural network is encoded in a weight matrix, where there is one weight for each layer. $X$ is one such weight matrix.

Edit 2: The neural network weights are can be seen as an approximation to the real distribution of the inputs. We don't have the actual distribution, so this is our best guess. Information here is meant to be the relative difference between the unknown true distribution and our best guess distribution. I believe one could relate this to the free energy concept in a restricted Boltzmann machine. My goal here is to say, I have encoded $Z$% of the true distribution in my approximation, and this approximation can be stored in $Y$ amount of bits at best. So I know how much resource this information takes.

1

There are 1 best solutions below

6
On BEST ANSWER

A very commonly used measure of "information" in a matrix is the rank.

A nice way to see that this gives us a good measure of the "information contained" in the matrix is via the SVD. In particular, the rank of a matrix is the number of non-zero singular values. If these non-zero singular values are $\sigma_1,\sigma_2,\dots,\sigma_r$, then using a "compact SVD", we can recover the original matrix $A$ using these $r$ numbers and $2r$ "singular vectors". In particular, we have $$ A = U \Sigma V = \pmatrix{u_1 & \cdots &u_r} \pmatrix{\sigma_1 \\ & \ddots \\ && \sigma_r} \pmatrix{v_1 & \cdots & v_r} \\ = \sigma_1 u_1v_1^T + \cdots + \sigma_r u_1 v_1^T. $$ As a sanity check, note that the "uniform matrix" (i.e. the matrix whose entries are all equal) has a rank of $1$, which is indeed the lowest possible rank. A weakness to this metric is that the rank can only be a whole number (unlike entropy, for instance).

The idea of low-rank representing low information-content is the central idea behind applications of low-rank matrix completion to machine learning, which is a highly active area of research. It is also the main idea behind principal component analysis, which was in turn the idea behind one of the first facial recognition algorithms.


Another such measure is the "von Neumann entropy", which (as its name suggest) extends the idea of entropy to be about matrices. In its usual usage, von Neumann entropy is specific to positive definite symmetric matrices, but we can extend this idea to arbitrary matrices as follows:

Suppose that $A$ is a matrix with singular values $\sigma_1,\sigma_2,\dots,\sigma_n$ and that $A$ has been normalized so that $\sigma_1 + \cdots + \sigma_n = 1$. The "entropy" of $A$ can be defined as the entropy of the distribution $(\sigma_1,\dots,\sigma_n)$. That is, the entropy is equal to the sum $$ H = -[\sigma_1 \log(\sigma_1) + \cdots + \sigma_n \log(\sigma_n)] $$ where $0 \log(0)$ is defined to be zero. This metric of information gives us a defintion equivalent to the "entropy of entanglement" of a state in a bipartite quantum system.