Help on performing random projection to reduce the dimensionality of large vectors.

42 Views Asked by At

See my last post for what led me here

I have two vectors, X and Y, each 100 elements long, filled with values between 1.0 and 1.0.

If I perform the euclidean distance between them, I get some value. However, it is expensive to perform the euclidean distance between vectors this large (sometimes they are 500 elements long), so I would like to reduce their dimensions.

I learned about random projection and tried applying it. I generated 8 random vectors named V1 through V8 of length 100, all filled with random values between -1.0 and 1.0.

Then I calculated the Euclidean distance of X against V1, V2, V3, V4, V5, V6, V7, and V8, and saved the resultant vector as X1. I then did the same for Y against V1, V2, V3, V4, V5, V6, V7, and V8, saving the result as Y1.

I then calculated the Euclidean distance between X1 and Y1, and unfortunately the value wasn't that close to the Euclidean distance between X and Y (the original dataset).

Am I performing this operation correctly? Can it be applied in this context? And how does one determine how many random vectors to compute against in order to give reasonable accuracy?

1

There are 1 best solutions below

2
On

If the euclidean distance between $X$ and $Y$ is what you're after, then projecting your vectors into a lower-dimensional space will definitely not reproduce that value (in general). The idea behind dimensionality reduction is that the properties of your dataset you're after (separability, clusterability, etc.) are preserved, and furthermore, are easier to extract from fewer dimensions. Having equivalent distances (before and after the projection) is not a requirement for this.

What might be interesting for you to look at is if the pre/post distances are correlated. If you're doing your projection correctly, they should be. Larger pre-distances should produce larger post-transform distances, and vice-versa. The typical way to quantify this is with a correlation coefficient. A positive value will indicate that this is occurring.

In addition, actual samples from a real-world process might be more interesting than that of random data. Samples from a normal distribution being preserved across a transformation isn't very interesting, but in theory you should see the clusters and class boundaries of an actual dataset be preserved across your projection.