Strange distribution of movie ratings

2k Views Asked by At

I like math but I also like movies. I have been collecting movies all my life. My collection is rather huge: almost 25.000 movies. Being also a developer I was able to create my own online catalogue and pull various statistics from the database. There is one thing that puzzles me.

Movies have ratings and I did not invent mine: I have copied them from IMDb. As you probably already know, IMDb ratings go from 1 to 10, with 1 being the lowest. I have created a histogram representing ratings distribution and it looks like this:

IMDb movie ratings distribution

I expected to see something like normal distribution, but my histogram has a funny dip around rating 7.0.
Is this a known phenomenon in statistics?
Has anyone seen something like this in other data?

4

There are 4 best solutions below

8
On BEST ANSWER

You can get the full IMDB dataset (updated daily) from here !

On it (as of 27/06/2023) are 293,501 rated films. The distribution of their rating is shown below:

enter image description here

As you can see, the full dataset doesn't show the same bimodal distribution as the curated sample in the question.

This suggests that the sampling is producing this bimodality. There are lots of possible reasons for this but perhaps the datasets will let you explore a bit more.


Many of the films have 500 votes or fewer. If we discount those, we're left with around 58k films whose ratings distribution is below:

enter image description here

One striking fact about these charts is quite how high the ratings are. It seems a rating of 5 does not correspond to an "average" film. Perhaps you get a few ratings points for making a film at all ;-).

4
On

It is a case of a Bimodal Distribution which will have two Peaks.

In general , these Bimodal Distributions are a "mixtures" of 2 Unimodal Distributions , which may be hidden.

Here , I would guess (because I do not have more Details to figure out) that the IMDB users are of 2 general types : "those who think average movies should have average rating which is ~6" & "those who think average movies should have average rating which is ~7". Put the users together & we will get 2 Mode values with 2 Peaks.

When we are able to group the users into those 2 types & then make the Individual Charts , we will get 2 Unimodal Distributions.

Intuitive Examples:

(1) When we make the Distribution of Weight/Height/Speed-to-run-100-meters/Strength-to-lift/Etc among Population in general or among Olympic Players , we may get Bimodal Distribution.

When we make the Distribution of Weight/Height/Speed-to-run-100-meters/Strength-to-lift/Etc among male Population or among male Olympic Players , we may get Unimodal Distribution.
Likewise , we will Unimodal Distribution among females.

The merger will give Bimodal Distribution.

(2) There are Cases with more than 2 Peaks , Multimodal Distributions , which are "mixtures" of more than 2 Unimodal Distributions.
Distribution for "Time of Maximum-Customers" in Canteens may have 3 (or 4) Peaks during breakfast , lunch (& evening tea time) & Dinner.

5
On

That may just be noise.

Your distribution does not really have a normal distribution (the plateaued peak from about $6.4$ to $7.4$ is wide compared to how fast the distribution falls away either side of this, particularly on the left) but, even if it did, you could easily see something similar.

Here is a simulated example using R with $25000$ samples from a normal distribution and it also has a dip about $7.0$. Using a different seed would give a different pattern of peaks and troughs in the middle of the distribution but similar noise.

set.seed(2023)
plot(table(round(rnorm(25000, 7, 0.8), 1)), xlim=c(4.5, 9.0))

enter image description here

0
On

This is known as a "bimodal distribution". The modes is this case are close together, so it's not a very strong effect. You could model the distribution as being the sum of two normal distributions with slightly different means; i.e., there are two "types" of movies that you like, one that averages slightly higher ratings than the other.

The idea that distributions tend to be normal comes from the fact that a lost of numbers come from a bunch of different effects, each one only a small percentage of the total effect, and uncorrelated or not very correlated with the others. Bimodality suggests that there are some factors that have very large effect, and/or are strongly correlated with each other. It could be that there's two clusters of movies, one slightly better than the other. Or there's one cluster of reviewers that tend to give slightly below 7, and another slightly above, and they tend to review different movies. But given that according to Chris Lewis' charts, movies in general are not bimodal, it seems that there are two clusters of movies that you've collected. For example, maybe half of your movie collection was chosen by you, and the other half was chosen by your partner.

Bimodality is one characteristic than can distinguish a distribution from a normal one. Other ones are what are called "higher moments". The first moment describes where the center is, and the second describes how spread out it is. These two moments vary from one normal distribution to another, and knowing these two moments and that a distribution is normal tells you what its value everywhere is. For a normal distribution, all moments past the second are determined by the first two, so if the actual moments don't match what they would be for a normal distribution, that's another way the distribution deviates from normality.

The third moment basically measures how symmetrical the distribution, and corresponds to "skew". A normal distribution is perfectly symmetrical, and thus has zero skewness. Your distribution has negative skew, which means that it fades away more slowly on the right than it does on the left.