I asked a question about this yesterday and got a really good response!
Apparently I should use the euclidian (sum of squares) distance between the two vectors.
This works well, but I'm having a bit of an issue.
Here's why this is a problem:
[ [
-1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
1 1
] ]
The answer to this one is 0.684
Considering that every pair matches except one, it should be much higher. I'm not sure if this is "technically" accurate, but I feel like it should be really close to 1, no? All the 40 data points are 1 except for one and it's only at .68.
Is there another algorithm that makes the result closer to intuition or that allows me to more heavily weight the length of the vectors or something?
Thanks for reading!
Edit: I was just walking around and I thought of what I think the real answer should be close to. Because every point is the same except one and there's 40 points, I figure it should probably be around 1 - (1 / 40) = .975? Or maybe 1 - 2(1 / 40) = 0.95 because it goes from -1 to 1 and not just 0 to 1. Not sure, but that definitely sounds more right!