Finding similarity between elements using statistics

312 Views Asked by At

I have a dataset of DJs in which I'm trying to find DJs similar to a specific DJ. Each DJ has a set of a genres with a certain percentage. How can I find the similarity between 2 DJs? The following is sample data of DJs with their song count and their percentage of songs in each genre.

DJ Antoine:
Songs: 105
Pop / Rock 0.95%
Indie Dance / Nu Disco 0.95%
Electronica 40.95%
House 21.9%
Electro House 10.48%
Tech House 0.95%
Progressive House 23.81%


Quentin Mosimann:
Songs: 31
Progressive House 48.39%
House 19.35%
Hard Dance 3.23%
Electro House 29.03%


Project 46:
Songs: 20
Progressive House 80.0%
Electro House 20.0%


Blasterjaxx:
Songs: 62
Progressive House 20.97%
House 12.9%
Electro House 66.13%


D-Block & S-te-Fan:
Songs: 13
Hard Dance 92.31%
Hardcore / Hard Techno 7.69%


Dillon Francis:
Songs: 53
Indie Dance / Nu Disco 15.09%
Electronica 1.89%
House 7.55%
Breaks 1.89%
Electro House 52.83%
Chill Out 1.89%
Dubstep 15.09%
Tech House 1.89%
Progressive House 1.89%


Dannic:
Songs: 37
Progressive House 91.89%
Trance 2.7%
Electro House 2.7%
House 2.7%


Adaro:
Songs: 24
Trance 4.17%
Hard Dance 62.5%
House 4.17%
Hardcore / Hard Techno 29.17%


Richie Hawtin:
Songs: 79
Electronica 6.33%
Chill Out 25.32%
Techno 60.76%
Minimal 5.06%
Tech House 2.53%


Martin Solveig:
Songs: 51
Electronica 7.84%
House 37.25%
Electro House 11.76%
Chill Out 11.76%
Deep House 9.8%
Indie Dance / Nu Disco 13.73%
Progressive House 1.96%
Hip-Hop 5.88%


Felguk:
Songs: 49
Psy-Trance 2.04%
Dubstep 4.08%
Electro House 93.88%


Myon & Shane 54:
Songs: 68
Progressive House 10.29%
Trance 83.82%
Techno 1.47%
Electro House 2.94%
Tech House 1.47%


Cosmic Gate:
Songs: 99
Progressive House 2.02%
Trance 97.98%
3

There are 3 best solutions below

0
On

I do not think this is a question related to statistics (which verifies some statistical hypothesis related to the data). Here there is no working hypothesis, and the sample size is just too small.

For your problem a simple approach is to change each DJ into a vector, and use a dot product to measure the distance between two vectors. The size at here might be and might nor be a factor for your consideration, as there is possibility that one DJ with more songs available will be more similar to the other DJ. You can change the inner product to get a different measurement. This is a very crude way of measuring the similarity, though.

0
On

There exists no univoque answer; I can provide you with an example, though. Let us believe that the distribution of songs (w.r.t. the song genre) is more interesting / relevant than the sheer amount of songs produced by each DJ in order to define a measure of similarity between them. With this idea in mind, we would consider as "near" two DJs with similar song distribution but rather different amount of songs and as "distant" two DJs with comparable numbers of songs but distributed very differently across the song genres.

What I am suggesting here is to split:

  • the number of songs
  • distribution of songs across song genres

and then produce a measure of dissimilarity $d(A,B)$ between DJs $A$ and $B$ as follows:

$$d(A,B)=\omega d_s(A,B)+(1-\omega)d_g(A,B). $$

Let us discuss this formula. With $d_s(A,B):=|s(A)-s(B)|$ we denote the distance in the space of songs, as $s(\cdot)$ is the number of songs produced by a DJ. With

$$d_g(A,B)=\sum_{i=1}^N (g_i(A)-g_i(B))^2 $$

we denote the Euclidean distance in the space of the $N$ song genres of DJs $A$, $B$. CLassify each song genre with an integer $i\in\{1,\cdots, N\}$ and per each DJ, for example $A$, define the associated song genre vector $g(A)=(g_1(A),\cdot, g_N(A))$, where $g_i(A)$ denotes the percentage of songs (transformed in decimal form) produced by $A$ in the song genre $i$.

As the result of $d_s(A,B)$ is an integer and it is predominant w.r.t. the contribution of the Euclidean norm $d_g(A,B)$ we introduce a weight $\omega\in [0,1]$ to reduce the impact the song number contribution in the distance $d(A,B)$.

The weight itself is not known a priori and it is considered as a parameter in the exercise. Different choices of the weight give different classifications of DJs: a quick check of the results should indicate which choice of weight is more suitable for further steps.

Another approach would be to standardize all contributions (song numbers, songs per genre) per each DJ and apply the Euclidean or Manhattan distance to the result. The standardization would avoid the weighting.

0
On

The cosine similarity measure should work fine. Let the vector space be spanned by the percentage of time each DJ plays a certain genre, which would be a number larger than zero if he plays music from that genre, or zero otherwise. Then the similarity between DJ1 and DJ2 is given by

$sim({DJ}_i, {DJ}_j) = \frac{{DJ}_i \cdot {DJ}_j}{|{DJ}_i||{DJ}_j|}$,

where

${DJ}_i = (PG_{i,1}, PG_{i,2}, ..., PG_{i,k})$, and

${PG}_{i,j}:=$ percentage of time genre $j$ is played by $ {DJ}_i$, and lastly $k := $ # of genres in the data set.