Gaussian normalized by local density

25 Views Asked by At

programmer coming over to the light side. You guys are smarter, and your stack exchange is cleaner, so please bear with me.

To summarize what I'm trying to do in general:

I have a very high dimensional space (900k dimensions), and I'm working on reducing that space to a subspace with N dimensions (50 ish). The best space would be the one that best separates binary classified data.

Here's a few examples of those subspaces visualized in 2 dimensions with an algorithm called t-sne. These are the same data, but represented in different subspaces:

Arbitrary Subspace 1

Arbitrary Subspace 2

The trick for me is how to programmatically (and thereby mathematically) defining how well binary classified data is separated in a given space. The t-sne algorithm has to classify clustering as well, and I borrowed some ideas from it, you can see the video which inspired my thought process if you're interested, specifically around the 10 minute mark. Here's the general process I came up with for mathematically defining how well a space is separating binary classified data:

  • fitness has an initial value of 0
  • For point i in all points, create a Gaussian which is normalized by the local density of data at i's position
  • For all other points j, get the value of the Gaussian centered on i at the position of j, and call that closeness
  • if j is the same classification as i, increase fitness by closeness, if they are different classifications decrease fitness by closeness.
  • fitness is then used to define how well a particular space separates classified data.

The only thing that I can't figure out is how to normalize the Gaussian based on the local space about point i. In other words: if the data around point i is packed closely together I want the Gaussian to be small, and if the data is spread out around point i, the Gaussian should be more expansive.