I have a dataset of DJs in which I'm trying to find DJs similar to a specific DJ. Each DJ has a set of a genres with a certain percentage. How can I find the similarity between 2 DJs? The following is sample data of DJs with their song count and their percentage of songs in each genre.
DJ Antoine:
Songs: 105
Pop / Rock 0.95%
Indie Dance / Nu Disco 0.95%
Electronica 40.95%
House 21.9%
Electro House 10.48%
Tech House 0.95%
Progressive House 23.81%
Quentin Mosimann:
Songs: 31
Progressive House 48.39%
House 19.35%
Hard Dance 3.23%
Electro House 29.03%
Project 46:
Songs: 20
Progressive House 80.0%
Electro House 20.0%
Blasterjaxx:
Songs: 62
Progressive House 20.97%
House 12.9%
Electro House 66.13%
D-Block & S-te-Fan:
Songs: 13
Hard Dance 92.31%
Hardcore / Hard Techno 7.69%
Dillon Francis:
Songs: 53
Indie Dance / Nu Disco 15.09%
Electronica 1.89%
House 7.55%
Breaks 1.89%
Electro House 52.83%
Chill Out 1.89%
Dubstep 15.09%
Tech House 1.89%
Progressive House 1.89%
Dannic:
Songs: 37
Progressive House 91.89%
Trance 2.7%
Electro House 2.7%
House 2.7%
Adaro:
Songs: 24
Trance 4.17%
Hard Dance 62.5%
House 4.17%
Hardcore / Hard Techno 29.17%
Richie Hawtin:
Songs: 79
Electronica 6.33%
Chill Out 25.32%
Techno 60.76%
Minimal 5.06%
Tech House 2.53%
Martin Solveig:
Songs: 51
Electronica 7.84%
House 37.25%
Electro House 11.76%
Chill Out 11.76%
Deep House 9.8%
Indie Dance / Nu Disco 13.73%
Progressive House 1.96%
Hip-Hop 5.88%
Felguk:
Songs: 49
Psy-Trance 2.04%
Dubstep 4.08%
Electro House 93.88%
Myon & Shane 54:
Songs: 68
Progressive House 10.29%
Trance 83.82%
Techno 1.47%
Electro House 2.94%
Tech House 1.47%
Cosmic Gate:
Songs: 99
Progressive House 2.02%
Trance 97.98%
I do not think this is a question related to statistics (which verifies some statistical hypothesis related to the data). Here there is no working hypothesis, and the sample size is just too small.
For your problem a simple approach is to change each DJ into a vector, and use a dot product to measure the distance between two vectors. The size at here might be and might nor be a factor for your consideration, as there is possibility that one DJ with more songs available will be more similar to the other DJ. You can change the inner product to get a different measurement. This is a very crude way of measuring the similarity, though.