In my application, I have a vectors of arbitrary length, but that normally fall in the range of between 0 and 500 elements. Each value in the vector is between -1 and 1. I am computing the similarity of two arbitrary vectors with the following SQL:
1 - SUM(ABS(v1.value - v2.value)) / COUNT(*)
(where v1.value and v2.value refer to values of the same matching item)
However, the dimensions are too large for these on-the-fly similarity calculations, so I would like to perform dimension reduction to improve performance.
I found out about random projection, and I am trying to apply it, but I'm not having much luck it seems, and was wondering if anyone could spot the flaw in my logic.
First, I expensively computed the similarity between X and Y (two vectors, each with 100 dimensions). This was so that I could know what value I should be shooting for after performing dimension reduction.
I first generated 8 vectors of the maximum possible length (100), all filled with random values between -1 and 1. Let's call them V1 to V8.
I then computed the similarity of X against V1, V2, V3, V4, V5, V6, V7, and V8, and saved that vector of respective results as X1. I also computed the similarity of Y against V1, V2, V3, V4, V5, V6, V7, and V8 and saved that vector of respective results as Y1.
I then computed the similarity of X1 against Y1 with the above SQL (the same method used to compute the similarity between X and Y), but got a much different result than when I computed the similarity of X against Y. This is unfortunate because I think random projection is quoted as being surprisingly accurate, so I was hoping the similarity between X and Y would be very close to the similarity between X1 and Y1.
To clarify, X and Y are the original vectors, and X1 and Y1 are the resultant "reduced dimension" vectors.
So my questions are:
1) Did I do this properly? I'm not sure if I fully understand random projection yet so maybe I messed up a step
2) How do you determine how many random vectors to compute against? Most of my vectors are going to have between 100 and 300 dimensions. How many dimensions should my reduced vectors have and still be reasonably accurate?
3) Is this possibly not working because I am using the Manhattan Distance instead of the Euclidean Distance? I initially chose the former because the computation is much simpler and I needed it to be fast.
I don't know a lot (indeed, hardly anything) about random projection, but I'm guessing that the projections are supposed to be orthogonal projections. That would require that your $V$ vectors be unit-length and mutually perpendicular (with respect to the inner product you're using).
A deeper problem is that you've got vectors in various-dimensional spaces, and you're putting them all into one large-dimensional space (by filling with zeroes?) and computing inner products --- that seems like a very peculiar choice. Before advising further on how to make random projection work, I'd like to know the nature of the data we're working with. For instance, it sounds as if your resulting dataset might have a real "preference" for certain axis-aligned subspaces (those consisting of vectors with lots of zeroes at the ends). It'd be useful to know whether that has any interactions with the assumptions of the theorems about random projection.
Finally, using the Manhattan metric might have serious consequences as well.
Short summary: read the theorems on random projections and look at their hypotheses closely to see whether they apply to your situation.