Zero as many correlations as possible while retaining as much correlation information as possible

21 Views Asked by At

I have a correlation matrix $C$ between $n$ variables, and I need to make this correlation matrix as sparse as possible while retaining the maximum overall correlation. I know this is likely too vague a description of what I want to be mathematically meaningful, so let me give a few examples of what I mean and hopefully someone can help me clarify exactly what I am looking for and perhaps give me some directions on how I might achieve it.

Since the matrix is symmetric I only consider the lower triangular part.

Suppose variable $i$ is strongly correlated to variables $k,j,l$, and $k,j$ are strongly correlated to each other, but less strongly to $l$, in that case I would likely remove one of the correlations $c_{ij},c_{ik},c_{kj}$ since the correlation between these variables is still there without one of them. I would also keep the correlation $c_{il}$ but would likely get rid of $c_{lj},c_{lk}$.

Suppose another variable $r$ is not strongly correlated with any other variables, in that case I would likely set all correlations related to $r$ to zero or maybe keep the single largest correlation.

A simple first approach might be to remove all correlations less than a certain threshold and only keep the $m$ strongest of these and then remove any connection between these $m$ variables, but I would like something less ad hoc, and I am guessing that this can be translated to a fairly general type of problem, I just do not know which.

I don't know what the correct framework to solve such a problem in is. I'm guessing I can likely do something with eigen-decomposition, but I am not sure how I would get the entries this way. Another framework that seems obvious is to treat this as a graph problem with all the correlations as edges and then try and reduce the amount of edges as much as possible under some metric I don't know yet. Finally, I guess it could be viewed as some kind of information theory problem, though I do not have a lot of experience with that field, so I can't really say more.

One extra thing to consider that makes me lean more towards the graph approach is that some variables are more important than others, so eventually I will need to take that into consideration, which seems very natural to do in a graph framework where I can assign different values to the nodes.

To give an estimate of the size of problem I am dealing with, $n$ is likely to be in the range of $100-500$ and I would probably like to cut the amount of connections down to $\sim 2n$