I was studying Cosine Similarity and I have just seen this article. https://medium.com/@rahulkuntala9/cosine-similarity-and-handling-categorical-variables-29f907951b5
The author uses Cosine Similarity in order to find the similarity between the p1 and the other vectors.
p1 = (1,0,0,150), newp1 = (1,0,0,100), newp2 = (1,0,0,200), newp3 = (0,0,1,135) and newp4 = (0,1,0,250)
Similarity(p1,newp1) = 0.999994
Similarity(p1,newp2) = 0.999998
Similarity(p1,newp3) = 0.99995
Similarity(p1,newp4) = 0.99994
My question is: Since I want the Cosine Similarity to be the weight to some values, how can I use these results in order to do that? All the similarities are almost 1 with no actual differences. I think that there is no reason to use these results. I have thought to use Euclidean Distance to find the similarity but I know that it is not the best method to find similarity.
What do you propose? Thank you!
Cosine similarity essentially measures the angle between two vectors.
If you think geometrically you can see why all your values are close to $1$. Consider the two vectors $(1,0,100)$ and $(0,1,150)$ in three dimensions. Each sticks up nearly vertical from the $x-y$ plane so tha angle between them is very small.
To separate the vectors in your application you have to find another way to take into account the differences in the first three categorical variables. There is no off-the-shelf formula for that.
If you deal with the categorical variables separately then perhaps the Euclidean distance will be reasonable for the others. It will see the large differences in values in the fourth coordinate.
Your answer will have to depend on what the variables actually mean in your context.