Suppose there are 2 discrete random variables $x$ and $y$ that each can take $n$ possible values, and $n$ is not too large, at most 10. Then their joint PMF $p \in \mathbb{R}^{n^2}$ is a histogram of $n^2$ bins, which are all non-negative and sum up to 1. I am interested in finding a map $f: \mathbb{R}^{n^2} \rightarrow \mathbb{R}^m$ which satisfies the following:
- It is symmetric under exchange of the random variables $$f(p[x, y]) = f(p[y, x])$$
- It preserves as much information about its argument as possible. That is, given $f(p)$ is should be possible to reconstruct as much of $p$ as possible. I'm not quite sure how exactly to formalize this statement, I welcome suggestions.
The dimension $m$ is a free parameter and can be chosen sufficiently large to satisfy the above constraints. Smaller $m$ are preferable if constraints are satisfied. The goal is to find an algorithm that implements any solution to this problem and demonstrate that it satisfies the conditions. In case optimal solution is intractable, an approximate solution may also be of interest.
My attempts:
- $f(p)$ = const. For sure satisfies first condition but not the second.
- $m = n^2$, two indices go over the values of $x$ and $y$ respectively and
$$ f_{ij} = \begin{cases} \min(p_{ij}, p_{ji}) & i < j \\ \max(p_{ij}, p_{ji}) & i \geq j \end{cases}$$
This looks like a good candidate solution, but I don't know how to check if it is optimal or if one can do better. I strongly suspect it is not optimal, because, for example, given $p$ I can test if $x$ and $y$ are independent, but given this version of $f$ I can't, and this information should not depend on their order.
Edit: Let me make the independence example more clear. If $x$ and $y$ are statistically-independent, then $p[x,y] = p[x]p[y]$, where $p[x] = \sum_y p[x,y]$. The test for statistical independence is $\Delta = p[x,y] - p[x]p[y] = 0$. $\Delta$ is invariant of the order of $x$ and $y$, and thus an optimal $f(p)$ should preserve that information. It is easy to verify that my attempt 2. does not preserve this information, so it is not optimal
Edit 2: There was a request to know the purpose of this exercise. It is as follows. I have a recording of many variables (e.g. 50) over many trials (e.g. 1000). I want to explore what sort of pairwise and triple-wise statistical relations are likely among these variables. For this purpose, I can compute all pairwise and triple-wise empirical PMF's. Now I would like to perform some classification/clustering on those PMF's. A standard tool to use a low-dimensional embedding, e.g. tSNE. However, to do the embedding, the dimensions of the PMFs have to have the same meaning for all PMFs. This is where the problem comes in - if I can succeed to filter out the permutation-asymmetric part from the PMFs, the resulting vectors will all have the same meaning, and can be clustered. For this reason, there are other properties of $f$ that are desirable for good result. Namely, it would be good if $f$ does not experience discontinuous jumps due to small perturbations. Generally, it would be good if $f$ does not amplify small differences and does not dampen large differences.
Here is a standard tool which does something like this, although it satisfies a modified version of your first condition. Namely, you can think of the joint distribution in $\mathbb{R}^{n \times n}$ as an $n \times n$ matrix, compute the $k$ largest singular values $\sigma_1 \ge \dots \ge \sigma_k$ in its singular value decomposition $P = U \Sigma V^T$, and take the truncated SVD
$$P_k = U_k \Sigma_k V_k^T$$
where only the $k$ largest singular values and the corresponding singular vectors are kept and the others are discarded. By the Eckhart-Young theorem this is the best approximation of $P$ by a matrix of rank $\le k$, with respect to both the operator / spectral norm and the Frobenius / Hilbert-Schmidt norm. The accuracy of the approximation is controlled by the next largest singular value $\sigma_{k+1}$ (at least in the operator norm) so if this is small then the approximation is good.
This is like trying to compute the best approximation of the joint distribution by a mixture of $k$ distributions where $x$ and $y$ are independent (these correspond to rank-$1$ matrices) except that $P_k$ is not guaranteed to be non-negative or to sum to $1$. It's an interesting question what is the actual best approximation by a mixture of $k$ distributions where $x$ and $y$ are independent; I don't know the answer.
Exchanging the random variables corresponds to taking transposes, and the truncated SVD has the property that its transpose
$$P_k^T = V_k \Sigma_k U_k^T$$
is the truncated SVD of $P^T$; in other words it is not invariant with respect to transpose but is instead naturally equivariant. You can force it to be invariant by identifying $P_k$ and $P_k^T$ but this doesn't seem to me to be at all a natural thing to do.