Say we have a matrix M where each column is a book, each row is a user, and cells are ratings - cells may be empty (not rated), or have values from 1 (didn’t like the book) to 5 (loved the book).
If I calculate a correlation coefficient (Pearson’s r) between 2 columns (books), then a low value (-1 is minimum) means, roughly speaking, that users who loved book A hated book B, and a high value (1 is maximum) means that users who loved book A also loved book B.
The problem is that if only a very small number of people (let’s represent this number as N) read a given pair of books, we shouldn’t take that correlation value too seriously. In order to weight high-N correlations higher and low-N correlations lower, I looked into confidence intervals.
The process for getting a confidence interval on a correlation value r, according to OnlineStatBook is:
- Convert r to z' (Fisher's z' transformation)
- Compute a confidence interval in terms of z'
- Convert the confidence interval back to r
Using the above instructions and formulae, we get results like the following (if you speak python, you can see the calculations at this Colab notebook):
For an r of 0.999 and N=4 (very low number of readers who read both books), we get a lower bound 95% confidence interval of 0.95.
It seems odd to me that with only 4 samples, we can be 95% confident that the correlation is at least 0.95 (extremely high), even when the initial r is so high.
For practical purposes, I can make tweaks to "punish" low Ns more, but I'm sure I'm missing something. I'd love to get a deeper understanding, so any links or pointers would be appreciated.
With so few samples you are losing normality. I edited my previous response because I realized your data is not two categories but rather 5 different ratings. One might do a linear regression of one variable vs the other. Then shuffle the A values to all their permutations and keep computing the coefficient. Assuming the truth is no correlation or a coefficient of 0, you can then compute the p-value of your coefficient based on the data.