Is it wrong to use Binary Vector data in Cosine Similarity?

7k Views Asked by At

I am doing Information Retrieval using Cosine Similarity.

My data is a binary vector.

Since most of the references I read were using non-binary vector (non-binary matrix) data, I am wondering if it is wrong to use binary vector data in the cosine similarity function.

2

There are 2 best solutions below

1
On

Using binary vector data works perfectly for doing cosine similarity studies. Actually, it makes the arithmetic much simpler because the magnitude of each vector is simply equal to the squareroot of the sum of its entries.

0
On

Consider looking at the Jaccard coefficient and Tanimoto coefficient. These two are probably a bit more sensible for binary data.

You can obviously use cosine distance, but computing it this way makes things overcomplicated when you have binary data. The dot product boils down to computing the size of the intersection set, the vector length are the number of bits set. Realizing that you are just looking at set sizes leads to more straightforward and fast ways of computing these things.