Permutation-invariant embedding preserving as much information as possible

290 Views Asked by At

Suppose there are 2 discrete random variables $x$ and $y$ that each can take $n$ possible values, and $n$ is not too large, at most 10. Then their joint PMF $p \in \mathbb{R}^{n^2}$ is a histogram of $n^2$ bins, which are all non-negative and sum up to 1. I am interested in finding a map $f: \mathbb{R}^{n^2} \rightarrow \mathbb{R}^m$ which satisfies the following:

  1. It is symmetric under exchange of the random variables $$f(p[x, y]) = f(p[y, x])$$
  2. It preserves as much information about its argument as possible. That is, given $f(p)$ is should be possible to reconstruct as much of $p$ as possible. I'm not quite sure how exactly to formalize this statement, I welcome suggestions.

The dimension $m$ is a free parameter and can be chosen sufficiently large to satisfy the above constraints. Smaller $m$ are preferable if constraints are satisfied. The goal is to find an algorithm that implements any solution to this problem and demonstrate that it satisfies the conditions. In case optimal solution is intractable, an approximate solution may also be of interest.

My attempts:

  1. $f(p)$ = const. For sure satisfies first condition but not the second.
  2. $m = n^2$, two indices go over the values of $x$ and $y$ respectively and

$$ f_{ij} = \begin{cases} \min(p_{ij}, p_{ji}) & i < j \\ \max(p_{ij}, p_{ji}) & i \geq j \end{cases}$$

This looks like a good candidate solution, but I don't know how to check if it is optimal or if one can do better. I strongly suspect it is not optimal, because, for example, given $p$ I can test if $x$ and $y$ are independent, but given this version of $f$ I can't, and this information should not depend on their order.

Edit: Let me make the independence example more clear. If $x$ and $y$ are statistically-independent, then $p[x,y] = p[x]p[y]$, where $p[x] = \sum_y p[x,y]$. The test for statistical independence is $\Delta = p[x,y] - p[x]p[y] = 0$. $\Delta$ is invariant of the order of $x$ and $y$, and thus an optimal $f(p)$ should preserve that information. It is easy to verify that my attempt 2. does not preserve this information, so it is not optimal

Edit 2: There was a request to know the purpose of this exercise. It is as follows. I have a recording of many variables (e.g. 50) over many trials (e.g. 1000). I want to explore what sort of pairwise and triple-wise statistical relations are likely among these variables. For this purpose, I can compute all pairwise and triple-wise empirical PMF's. Now I would like to perform some classification/clustering on those PMF's. A standard tool to use a low-dimensional embedding, e.g. tSNE. However, to do the embedding, the dimensions of the PMFs have to have the same meaning for all PMFs. This is where the problem comes in - if I can succeed to filter out the permutation-asymmetric part from the PMFs, the resulting vectors will all have the same meaning, and can be clustered. For this reason, there are other properties of $f$ that are desirable for good result. Namely, it would be good if $f$ does not experience discontinuous jumps due to small perturbations. Generally, it would be good if $f$ does not amplify small differences and does not dampen large differences.

2

There are 2 best solutions below

1
On

Here is a standard tool which does something like this, although it satisfies a modified version of your first condition. Namely, you can think of the joint distribution in $\mathbb{R}^{n \times n}$ as an $n \times n$ matrix, compute the $k$ largest singular values $\sigma_1 \ge \dots \ge \sigma_k$ in its singular value decomposition $P = U \Sigma V^T$, and take the truncated SVD

$$P_k = U_k \Sigma_k V_k^T$$

where only the $k$ largest singular values and the corresponding singular vectors are kept and the others are discarded. By the Eckhart-Young theorem this is the best approximation of $P$ by a matrix of rank $\le k$, with respect to both the operator / spectral norm and the Frobenius / Hilbert-Schmidt norm. The accuracy of the approximation is controlled by the next largest singular value $\sigma_{k+1}$ (at least in the operator norm) so if this is small then the approximation is good.

This is like trying to compute the best approximation of the joint distribution by a mixture of $k$ distributions where $x$ and $y$ are independent (these correspond to rank-$1$ matrices) except that $P_k$ is not guaranteed to be non-negative or to sum to $1$. It's an interesting question what is the actual best approximation by a mixture of $k$ distributions where $x$ and $y$ are independent; I don't know the answer.

Exchanging the random variables corresponds to taking transposes, and the truncated SVD has the property that its transpose

$$P_k^T = V_k \Sigma_k U_k^T$$

is the truncated SVD of $P^T$; in other words it is not invariant with respect to transpose but is instead naturally equivariant. You can force it to be invariant by identifying $P_k$ and $P_k^T$ but this doesn't seem to me to be at all a natural thing to do.

3
On

This is not a complete answer to the intended question -- but it aims to show points where the question might need additional clarification. I will also borrow the title of a movie: The Good, the Bad and the Ugly.

Contrary to the usual order, I'll start with the (obvious) Bad news: The first condition forces us to drop at least some of the information from the original PMF; since $f(p[x,y])=f(p[y,x])$, knowing only the value of $f$ does not allow us to determine, for example, the probability $P(x>y)$. In general, any question whose answer is not invariant under swapping $x$ and $y$ is going to become unanswerable.

Now, the Good news is that aside from this inevitable loss of information, we don't need to drop anything else as long as $m\geq(n^2-1)$. Choose one fixed reading order $o(p[])$ of the $n\times n$ table $p[]$ (e.g. reading it row-by-row, or column-by-column, or any other ordering which lists every item exactly once) which results in a list of $n^2$ numbers.

Since lists of numbers of equal length can be compared lexicographically (i.e. comparing them element-by-element and once we encounter differing elements, the list with the greater element is declared to be greater of the two) and $p[y,x]$ can be obtained from $p[x,y]$ without any additional information, we can define $$f(p[x,y])=\max\left(o(p[x,y]),o(p[y,x])\right)$$ Since $o(p[x,y])$ determines $p[x,y]$ completely, $f(p[x,y])$ determines the pair $\{p[x,y], p[y,x]\}$ completely too. As such, $f$ allows us to distinguish all the values we are allowed to distinguish and is thus optimal in this sense. As each list sums to $1$, we can actually omit the last element in $o(p[])$, reducing the dimension to $m=(n^2-1)$.

Having said that, this map $f$ has one bad property: Small change in one element can change the value of $f$ considerably. For example, consider the following table $p[x,y]$ (all numbers are understood to be divided by 10 to avoid messy decimals) and the row-by-row reading order:

$$p[x,y]=\left(\begin{array}{l} 0 & 2 & 0 \\ 2 & 0 & 2 \\ 0 & 3 & 1 \\ \end{array}\right)$$

We have $$\begin{array}{lcl} o(p[x,y]) & = & \left(0,2,0,2,0,2,0,3,1\right)\\ o(p[y,x]) & = & \left(0,2,0,2,0,3,0,2,1\right)\\ \end{array}$$

and thus $f(p[x,y])=\left(0,2,0,2,0,3,0,2\right)$ (the last element removed). However, a small change in the number $2$ in upper row yields a different picture: $$p'[x,y]=\left(\begin{array}{l} 0 & 2+\varepsilon & 0 \\ 2 & 0 & 2 \\ 0 & 3 & 1-\varepsilon \\ \end{array}\right)$$

We have $$\begin{array}{lcl} o(p'[x,y]) & = & \left(0,2+\varepsilon,0,2,0,2,0,3,1-\varepsilon\right)\\ o(p'[y,x]) & = & \left(0,2,0,2+\varepsilon,0,3,0,2,1-\varepsilon\right)\\ \end{array}$$ so $f(p'[x,y])=o(p'[x,y])=\left(0,2+\varepsilon,0,2,0,2,0,3\right)$, a big difference from f(p[x,y]).

If this kind of "discontinuous" behaviour is acceptable for the desired mapping, just wait what the Ugly brings: We can actually have $m=1$ and lose no information (other than the inevitable loss) at all! It is not actually all that surprising; since $|\mathbb{R}^k|=\mathbb{R}|$, there is a bijection $h:\mathbb{R}^{n^2}\to R$ so we can define $g:\mathbb{R^{n^2}}\to R$ as $$g(p[x,y])=h(f(p[x,y]))$$ and stuff everything into a single real number. There is no need to worry about computability either; as long as the numbers in $p[x,y]$ are computable, so is the combined value.

Of course, most of this breaks down if we impose some additional restrictions on the map $f$; continuity being the most obvious candidate. What is the actual goal you are trying to achieve by having a mapping with those two properties?