Beginner probability question: Bimodal distribution (ie like some Yelp reviews)

448 Views Asked by At

Background

Let's say a Yelp reviewer either gives 1 stars or 5 stars because when her experience is average she doesn't feel as motivated to write a review. Sometimes she will give 2 or 4 stars, and extremely rarely will give 3 stars.

Let's say $X =$ number of stars she will give

Question 1

You work for Yelp and your boss asks you what kind of RV is $X$ and what is the associated probability mass function? What is the formula for $P(X=k)$ ?

For example, if $X =$ average male height in feet, then $X$ is like a normal RV and we can use the pdf to find the $P(X>12)$

My attempt

  • Maybe you say to the boss, $X$ is a Beta RV with parameters like $(\frac{1}{2},\frac{1}{2})$ although Beta is continuous

  • In general I can't find an explicit pmf for the bimodal. When I go on the wikipedia for the Multimodal Distribution, it is the first distribution I've seen that doesn't have a pmf/pdf, cdf, mean, etc. It seems like the Bimodal should have its own thing like the Poisson or Gamma...


Thanks for your help and putting up with my ignorance.

2

There are 2 best solutions below

9
On BEST ANSWER

On $X$ and Modelling

Something to understand about random variables: they're functions, which are neither random nor variables. (No, this is not a fact from "basic" probability, but in a senior undergrad course in probability, or a graduate-level course, this is how random variables are approached.) We call $\Omega$ the sample space, and say that $X$ is a function defined on $\Omega$ that maps to a space $E$, denoted $X: \Omega \to E$. Thus $X$ is shorthand for $X(\omega)$ with $\omega \in \Omega$, and it is $\omega$ that is actually random; for a fixed $\omega$, $X(\omega)$ is completely determined. The notation $P(X = k)$ is shorthand for $P(\{\omega: X(\omega) = k\})$.

When you ask what kind of random variable $X$ is, that's purely a modelling problem. We may say that $X$ counts the number stars for a given rating, $\omega$. Then we may say $X$ maps from the space of user ratings, $\Omega$, to the natural numbers, $\mathbb{N}$, or $X: \Omega \to \mathbb{N}$. But we may also say that the operator "$+$" should not be defined, that different ratings cannot be added together; they make sense only in the sense of order. So you can say that $X(\omega_1) \leq X(\omega_2)$ but $X(\omega_1) + X(\omega_2)$ does not make sense. This is fine; in some sense, a rating system with 1-5 stars is just as informative as a rating system with stars from one to four, then finally nine stars (so replacing a five-rating with a nine-star rating, which is fine since it doesn't change rankings).

In short, we may say that $X$ is an ordinal random variable. If that's the case, we can define a probability mass function (pmf) and a cumulative distribution function (cdf), but $E[X]$ is not defined. (But the median is.)

As for what pmf $X$ should have, there is no canonical answer. Nor should there be; if $X$ always had the same pmf, rating systems would be useless since the distributions of all destinations' user ratings would be identical, which is like saying users have the same distribution of opinions for every destination on Yelp. Thus, no location is better or worse than another. This is clearly wrong. All you're going to say is that the pmf can be non-zero only for $k \in \{1, \ldots, 5\}$, but otherwise $P(X=k)$ can be anything, and statistical procedures should be used to figure out what the probability is.

We can describe the pmf of other discrete random variables, like the binomial or geometric random variables, because we explicitly describe a process or model that produces a natural assignment of probabilities. But if you told me that $X$ is a discrete random variable and you don't tell me the data generating process I cannot tell if $X$ is a geometric or a Poisson random variable.

The same is true for continuous random variables. People assign the Normal distribution to heights or IQs because they can; nothing says that these distributions are correct (in fact, they are certainly incorrect since heights and IQs cannot be negative, events which occur with a non-zero probability according to the Normal distribution). These distributional assumptions need to be tested.

I cannot think of a process that produces Yelp ratings, so I'm not going to assume what the ratings' distribution is.


A Bimodal Random Variable

From the comments you seem to want a bimodal distribution of some sorts. Below I will invent one that might make some sense.

A Yelp user either likes or dislikes a location, in some sense, and they like the location with probability $\alpha$. Then they need to decide how many stars to give the location. If they like the location, then they flip $n$ coins, with each coin having probability $p_1$ of appearing star-side up (the other side is blank). The number of stars seen is the rating the user assigns to the location. If they dislike the location they do the same thing but with a coin with probability $p_2$, and presumably $p_1 \geq p_2$ (though it doesn't matter). Then we will say that $X \sim \operatorname{2BIN}(\alpha, p_1, p_2, n)$.

Some work gives the pmf, cdf, expected value and variance of $X$ (where $F(x)$ is the cdf of $X$ and $B(x; n, p)$ is the cdf of a $\operatorname{BIN}(n, p)$ random variable):

$$p(x; \alpha, p_1, p_2, n) = {n \choose x} \left(\alpha p_1^x(1-p_1)^{n - x} + (1 - \alpha) p_2^x(1-p_2)^{n - x}\right) \text{ for } x \in \{1, \ldots, n\}$$

$$F(x) = \alpha B(x; n, p_1) + (1 - \alpha) B(x; n, p_2)$$

$$E[X] = n(\alpha p_1 + (1 - \alpha) p_2)$$

$$\operatorname{Var}(X) = \alpha (1 - \alpha) n^2 (p_1 - p_2)^2 + \alpha n p_1(1 - p_1) + (1 - \alpha) n p_2 (1 - p_2)$$

Below is some R code demonstrating how this random variable works.

library(discreteRV)

# pdf
d2bin <- Vectorize(function(x, alpha, p1, p2, N) {
  choose(N, x) * (alpha * p1^x * (1 - p1)^(N - x) +
                    (1 - alpha) * p2^x * (1 - p2)^(N - x))
}, vectorize.args = "x")

# cdf
p2bin <- Vectorize(function(q, alpha, p1, p2, N) {
  if (q < 0) {return(0)} else if (q > N) {return(1)}
  sum(d2bin(0:floor(q), alpha, p1, p2, N))
}, vectorize.args = "q")

# random choice
r2bin <- function(n, alpha, p1, p2, N) {
  sample(0:N, size = n, probs = p2bin(0:N, alpha, p1, p2, N),
         replace = TRUE)
}

# A five-star system
X <- RV(outcomes = 0:5, probs = d2bin, alpha = .5, p1 = .1, p2 = .8, n = 5)
plot(X)

Plot of 2BIN(.5, .1, .8, 5)

# A ten-star system
Y <- RV(outcomes = 0:10, probs = d2bin, alpha = .5, p1 = .2, p2 = .9, n = 10)
plot(Y)

Plot of 2BIN(.5, .2, .9, 10)

The random variable I describe works for an $n$-star rating system. In principle one could create distributions for multi-modal distributions like this.

3
On

You could check out the mixture of two normal distributions at wikipedia: https://en.wikipedia.org/wiki/Multimodal_distribution

Here's an example:

The black data points show how the rate of diagnosis (in cases per 100,000 people) of Hodgkin lymphoma (a kind of cancer) for white females depends on the age of the woman diagnosed. There are two peaks, one at about 20 years, the other at 75 years. There is no single value that can legitimately be called the mode.

The mean and the median would each be about 45 years, but they make no sense at all. What is probably happening is that the disease has two very different causes, one of which occurs more often in young people, the other in old people.

enter image description here

Data source: Surveillance, Epidemiology, and End Results (SEER) Program (www.seer.cancer.gov}) SEER*Stat Database: Incidence - SEER 9 Regs Limited-Use, Nov 2008 Sub (1973-2006) , National Cancer Institute, DCCPS, Surveillance Research Program, Cancer Statistics Branch, released April 2009, based on the November 2008 submission.