programmer coming over to the light side. You guys are smarter, and your stack exchange is cleaner, so please bear with me.
To summarize what I'm trying to do in general:
I have a very high dimensional space (900k dimensions), and I'm working on reducing that space to a subspace with N dimensions (50 ish). The best space would be the one that best separates binary classified data.
Here's a few examples of those subspaces visualized in 2 dimensions with an algorithm called t-sne. These are the same data, but represented in different subspaces:
The trick for me is how to programmatically (and thereby mathematically) defining how well binary classified data is separated in a given space. The t-sne algorithm has to classify clustering as well, and I borrowed some ideas from it, you can see the video which inspired my thought process if you're interested, specifically around the 10 minute mark. Here's the general process I came up with for mathematically defining how well a space is separating binary classified data:
fitnesshas an initial value of 0- For point
iin all points, create a Gaussian which is normalized by the local density of data ati's position - For all other points
j, get the value of the Gaussian centered oniat the position ofj, and call thatcloseness - if
jis the same classification asi, increasefitnessbycloseness, if they are different classifications decreasefitnessbycloseness. fitnessis then used to define how well a particular space separates classified data.
The only thing that I can't figure out is how to normalize the Gaussian based on the local space about point i. In other words: if the data around point i is packed closely together I want the Gaussian to be small, and if the data is spread out around point i, the Gaussian should be more expansive.