Estimating the probability distribution of favorite movies

73 Views Asked by At

Imagine that I repeat a random experiment in which I pick up someone in the population and ask his/her favorite movie : F

There exists about 300 000 movies but obviously F is not equi-probabilistic. Some movies have much more fans than others.

Now, imagine that I reproduce the random experiment N=1000 times and gather some $f_i$ ($1 \le i \le N$). How could I estimate the number of movies that would represent R=80% of the weight of favorite movies ?

My ideas to solve the problem :

An important step would be to assess the family of distribution that F belongs to. Then to find the parameters of that distribution.

Intuitively, I would say that a geometric distribution is a good candidate. Well, geometric distribution is over an integer value while, in my case, it is over movies. But If I rank the movies from the most popular to the less popular, I can conceive a geometric distribution over the rank. You can also point that a geometric distribution has no limit while, in my case, the number of movies is finite but it is great. So geometric distribution could be a good approximation.

I have found how to find a max likelihood estimator of the parameter p of the geometric distribution. (unbiased estimator for geometric distribution)

So, my idea would be :

  • ordering the movies in the samples set from the one that appears the more often in the $f_i$ to the one that appears the less.

  • estimate the parameter p with $p=\dfrac{n}{n+\sum_{i=1}^n(X_i-1)}$ (where $X_i$ is the rank of $f_i$).

  • use the geometric distribution CDF : $1-(1-p)^k$ to get the k that represent s 0.8

What do you think of my idea ?