Decoding Every Top 100 Voting Ever

763 Views Asked by At

I need expert help on the math behind the following voting mechanism, any comment towards solutions are greatly appreciated!

--

A country is holding a poll to determine the top 100 restaurants out of its 100 thousand domestic restaurants.

Assuming each restaurant is unique, no chains or franchises, all of them share equal and fair chance of exposure to all customers nationwide.

In the end there were 1 million valid voters, with each voter named 5 restaurants. Say each voter entered 5 non-repeat and valid entries.

Is there a way to estimate: 1. the minimum votes a restaurant needs to be in top 100; 2. how many votes does #1 get?

--

I'm not a math expert so my educated guess is each restaurant has a probability of 0.1% chance to be voted but that's as far as I could go, I have absolutely 0 idea how to proceed next…

Look forward to your help, thanks!

--

[UPDATE 07/17/2019]

Thank you guys for all the insightful inputs! I’m confident that we’re on the right track to finding out the best answer and we’re close.

If we are to put our solutions to test in a real world scenario however, I think it’s highly unlikely that a mere 2 digit votes will render a lucky restaurant Top 100. I assume we can all agree on this?

So now the question is: how can we apply common sense to the equation? How do we identify the distribution model in real world?

The Gaussian distribution seems to be an adequate one to describe the world we live in, where heavenly restaurants and unholy ones are extremely rare, and passable/half-decent ones constitute the majority.

Let's add more conditions to define the scenario:

  1. Each restaurant has a Twitter account
  2. The voting result is positively correlated with Twitter follower counts
  3. The #1 restaurant has 10,000,000 followers whereas #100 has 100K

Does this make more sense?

5

There are 5 best solutions below

8
On

It is incredibly unlikely but possible to have a massive tie: each restaurant gets $10$ votes. So, you could be the joint winner with just $10$ votes. If you want to be the unique winner then you need just one voter to switch. You will then have a clear winner with $11$ votes, lots will have $10$, and one poor unique loser will have just $9$.

You cannot be a winner, joint or not, with less than $10$ as if you have less than $10$ then at least one other will have more than $10$.

The other extreme, and again incredibly unlikely, is that one restaurant gets all of the votes and the others none.

Very, very unlikely but possible. If you want to exclude those cases then it gets more tricky. First, you would have to define what you accepted.

Update: I missed that each voter had $5$ votes. This makes a difference to the numbers but not my general point. Thanks to Henning for spotting this. The text above applies to the $1$ vote each variant.

The massive tie could still occur but each restaurant will have $50$ votes.

Similarly, some of the other numbers need to be adjusted.

2
On

(Discussion)

I think Billboard Music Award voting is a perfect example on such mechanism:

https://www.billboardmusicawards.com/2019-voting-rules/

The more popular artists (or restaurants in this case) get more votes thus fan counts may serve as an indication.

If we assume #1 has 10 millions followers on Twitter and #100 has 100K. Does that help in narrowing the distribution method?

2
On

If you want a simple Gaussian modelling of the process, you could do something like this

$$ PD = \frac{1}{\sqrt{2*\pi*\sigma^2}}\mathbb{e}^{ {\displaystyle \frac{-x^2}{2*\sigma^2}}} $$

where $\sigma$ is taken such that 1) There is not too much cluttering around 0, i.e. that only one restaurant gets all the votes. 2) The restaurants beyond x=1000,000, get very little votes, i.e. $${\displaystyle \int_{1000,000}^{\infty} PD\,dx \le \frac{1}{total\_votes\_expected}}$$. Thus it is good to have some idea of the total number of votes you are expecting, hence the relevance of some experimentation.

Now the integral $\int_{x}^{\infty}\mathbb{e}^{x*2}$ can be computed numerically at least, within certain error bounds, and maybe analytically too. (Check out, numerical integration of ${\displaystyle e^\frac{x^2}{k}}$). By doing that one can get the $\sigma$ required,as the integral would come out in terms of sigma. Put that value of $\sigma$ into $PD$. Then your required value will be $$ Minimum \ Votes \ required=PD(99)\times total\_votes\_expected $$ $$ Topmost \ Votes=PD(0)\times total\_votes\_expected $$

or if you want to be more accurate.

$$Minimum \ Votes\ required={\displaystyle \int_{99}^{100} PD\,dx \times{total\_votes\_expected}}$$

$$ Topmost \ Votes={\displaystyle \int_{0}^{1} PD\,dx \times{total\_votes\_expected}}$$

1
On

We can do something under the (very simplistic and almost certainly completely counterfactual) assumption that each of the $10^6$ voters pick 5 restaurants to vote for uniformly at random.

Suppose each voter does that by picking a random restaurant five times with replacement, and simply picks again if he picks one he has voted for already. Even for the fifth pick, the probability that the pick will need to be redone is only $\frac{4}{10^5} = \frac{1}{25\,000}$, which means that the re-picked votes will constitute an extremely small fraction of the total votes. Thus, simplifying approximation I: ignore the re-picks and instead just assume that there are $5\times 10^6$ independently uniform votes.

Thus, the number of votes each restaurant gets follows a binomial distribution with parameters $n = 5\times 10^{6}$ and $p=10^{-5}$.

In principle the number of votes two different restaurants get are not independent -- it is possible (according to our now simplified assumptions) to get all of the votes, with some very small probability, but not for two restaurants to both get all of the votes. However, realistically it is fantastically unlikely for any single restaurant to get so many votes that it meaningfully disturbs the probabilities of the other top-100 restaurants, so let's ignore that to. That is simplifying approximation II: Look not for the actual joint probabilities, but just for the number of votes such that a restaurant's probability of getting at least that many is $\frac{100}{100\,000}$.

The standard simplifying approximation III is now to approximate the binomial distribution with a normal (Gaussian) distribution. I don't remember the formulas to use but Wolfram Alpha tells me we get a mean of $50$ (well, I could have told that) and a standard deviation of $7.07$ votes.

How many standard deviations out do we have to be in order to be in the top 0.1%? We can ask Wolfram Alpha about that too and get $$ k \approx 71.8 $$

So if you carry out the experiment and it takes much more than 72 votes to get into the top 100, then the initial assumption about random voter behavior probably needs to be discarded.

0
On

We can simulate this (in Python):

from collections import Counter
import random

REST_COUNT = 100000
VOTERS = 1000000
VOTES = 5

rests = Counter()
for i in range(VOTERS):
    for j in range(VOTES):
        rest = random.randrange(REST_COUNT)
        rests[rest] = rests[rest] + 1

top = rests.most_common(100)
print (top[0], top[99])

Note: we've skipped the requirement that the places be unique (it's unlikely to be very relevant), and we're assuming all restaurants to be equally likely to be selected... which is probably untrue!

From five runs, I get top scores of 82, 84, 86, 87, 88, and joint hundreth place was always 73.