How to construct a maximum-information embedding of sampled objects using a binary function?

22 Views Asked by At

This feels like a very specific problem, but I hope there already is a method to achieve what I want. There is a random process from which I can draw samples of non-numerical, variable sized objects (basically labelled point clouds, but this is probably irrelevant now). I denote these samples with $\boldsymbol{x}$. I also have a bounded, binary function $f(\boldsymbol{x}_i, \boldsymbol{x}_j) \in \mathbb{R} _{[0,1]}$ taking two samples and producing a dissimilarity-measure between them. What I would like to do is to identify a fixed number $N$ of "exemplar" samples $\boldsymbol{x} _1, \boldsymbol{x} _2, \dots \boldsymbol{x} _N$ for which the vector $$ \boldsymbol{v}(\boldsymbol{x}) = [f(\boldsymbol{x} _1, \boldsymbol{x}), f(\boldsymbol{x} _2, \boldsymbol{x}), \dots f(\boldsymbol{x} _N, \boldsymbol{x})] $$ contains the most information about $\boldsymbol{x}$, basically creating an optimal, fixed-length embedding of it.

I was thinking about first calculating some kind of estimation for the mutual information between the vector dimensions (which I don't know how to do) and then trying to pick $N$ exemplars that minimize this common information (which I also don't know how to do; maybe using some kind of clustering algorithm?).

Or, maybe a looser method would be to use the correlation between the dimensions, instead of the mutual information.

I would appreciate any help, link or idea related to this. Thank you!