Maximum likelihood estimator of categorical distribution

9.6k Views Asked by At

The task is:

Population of the students has been divided into the following three groups:

  1. Students with the mean of grades below 3.5
  2. Students with the mean of grades between 3.5 and 4.5
  3. Students with the mean of grades above 4.5

Each student in the population is described by a vector of random variables $x= (x^1\ x^2\ x^3)^T$, taking one of three possible states: $(1\ 0\ 0)^T$ if the student belongs to the first group, $(0\ 1\ 0)^T$ if the student belongs to the second group,
and $(0\ 0\ 1)^T$ if the student belongs to the third group. The distribution of $x$ is categorical distribution (also known as generalized Bernoulli distribution or Multinoulli distribution) with parameters $\theta= (\theta_1\ \theta_2\ \theta_3)^T$. From the population of the students N examples were drawn. Calculate the maximum likelihood estimator of $\theta$.

I tried to do it similarly to Bernoulli case, but I'm stuck. The idea was to find $\theta^*$ by finding the maximum of probability distribution function. So my try was

$$ M(x\mid\theta)=\prod_{d=0}^D \theta_d^{x_d}=\theta_1^{x_1} \theta_2^{x_2} \theta_3^{x_3}\\ \theta^* = \operatorname*{argmax}_\theta M(x\mid\theta) = \operatorname*{argmax}_\theta \ln(M(x\mid\theta))\\ \ln(M(x\mid\theta))= \ln(\theta_1^{x_1} \theta_2^{x_2} \theta_3^{x_3}) = x_1\ln\theta_1 + x_2\ln\theta_2 + x_3\ln\theta_3 = x^T (\ln\theta_1\ \ln\theta_2\ \ln\theta_3)^T $$

Next step would be calculating derivative with respect to $\theta$ and finding it's zero, but we don't have $\theta$ in the function.

I'm not sure where is my mistake. Or perhaps there is no mistake and it is possible to convert $(\ln\theta_1\ \ln\theta_2\ \ln\theta_3)^T$ to some form with $\theta$?

2

There are 2 best solutions below

4
On BEST ANSWER

Since $(\theta_1,\theta_2,\theta_3)$ must satisfy the constraint $$\theta_1+\theta_2+\theta_3 = 1,\tag 0$$ one way to do this is by Lagrange multipliers. You have $$ \operatorname{grad} (\theta_1+\theta_2+\theta_3) = (1,1,1) \tag 1 $$ and $$ \operatorname{grad} (x_1\log\theta_1 + x_2\log\theta_2 + x_3\log\theta_3) = \left( \frac{x_1}{\theta_1}, \frac{x_2}{\theta_2}, \frac{x_3}{\theta_3} \right). \tag 2 $$ So you want a value of $(\theta_1,\theta_2,\theta_3)$ for which $(2)$ is a scalar multiple of $(1).$ That happens only if the ratio $\theta_1:\theta_2:\theta_3$ is equal to the ratio $x_1:x_2:x_3.$ But the constraint $(0)$ must also hold. Consequently you get $$ \theta_1 = \frac{x_1}{x_1+x_2+x_3} $$ and similarly for the other two values of the subscript.

0
On

The accepted answer does not infer the solution from the MLE approach rigorously, so for the completeness sake, I will write the full path to the solutions $$\theta_1 = \frac{1}{n} \sum_{i=1}^n x_{i1} \\ \theta_2 = \frac{1}{n} \sum_{i=1}^n x_{i2}$$ ($\theta_3 = 1 - \theta_1 - \theta_2$ is not needed) in the following without the use of a Langrange multiplier:

Assume $X_1, \dots, X_n \overset{iid}\sim M(\theta_1, \theta_2)$ with $X_i = (X_{i1}, X_{i2}, 1 - X_{i1} - X_{i2})^T$.
The tricky part is to see that each $X_i$ only consists of two random variables (consequently $M$ also only has two parameters, i.e. we can ignore $\theta_3$).
The log likelihood function of the observations $x_1, \dots, x_n$ is $$l(\theta) = \ln P(X_1 = x_1, \dots, X_n = x_n; \theta_1, \theta_2) \\ \overset{iid} = \ln \prod_{i=1}^n P(X_i = x_i; \theta_1, \theta_2) \\ = \sum_{i=1}^n \ln P(X_i = x_i; \theta_1, \theta_2) \\ = \sum_{i=1}^n x_{i1} \ln \theta_1 + x_{i2} \ln \theta_2 + (1 - x_{i1} - x_{i2}) \ln (1 - \theta_1 - \theta_2).$$ Then, the score function is $$ s(\theta) = \nabla_\theta l(\theta) = (\sum_{i=1}^n (\frac{x_{i1}}{\theta_1} - \frac{1 - x_{i1} - x_{i2}}{1 - \theta_1 - \theta_2}), \sum_{i=1}^n (\frac{x_{i2}}{\theta_2} - \frac{1 - x_{i1} - x_{i2}}{1 - \theta_1 - \theta_2}))^T.$$ Setting it to zero gives $$\sum_{i=1}^n \frac{x_{i1}}{\theta_1} = \sum_{i=1}^n \frac{1 - x_{i1} - x_{i2}}{1 - \theta_1 - \theta_2}$$ $$ \sum_{i=1}^n (x_{i1} - \theta_1 x_{i1} - \theta_2 x_{i1}) = \sum_{i=1}^n (\theta_1 - \theta_1 x_{i1} - \theta_1 x_{i2})$$ $$\theta_1 = \frac{\sum_{i=1}^n (x_{i1} - \theta_2 x_{i1})}{\sum_{i=1}^n (1 - x_{i2})} $$ and analogous with inserted $\theta_1$ $$\theta_2 = \frac{\sum_{j=1}^n (x_{j1} - \theta_2 x_{j1})}{\sum_{i=1}^n (1 - x_{i2})} \\ = \frac{\sum_{j=1}^n (x_{j2} - \frac{\sum_{i=1}^n (x_{i1} - \theta_2 x_{i1})}{\sum_{i=1}^n (1 - x_{i2})} x_{j2})}{\sum_{i=1}^n (1 - x_{i1})} \\ = \frac{\sum_{j=1}^n x_{j2}}{\sum_{i=1}^n (1 - x_{i1})} - \frac{ \sum_{i=1}^n \sum_{j=1}^n x_{i1} x_{j2}}{\sum_{i=1}^n (1 - x_{i1}) \sum_{j=1}^n (1 - x_{j2})} + \frac{\theta_2 \sum_{i=1}^n \sum_{j=1}^n x_{i1} x_{j2}}{\sum_{i=1}^n (1 - x_{i1}) \sum_{j=1}^n (1 - x_{j2})}$$ minus the last fraction gives $$\theta_2 \frac{\sum_{k=1}^n (1 - x_{k1}) \sum_{l=1}^n (1 - x_{l2}) - \sum_{i=1}^n \sum_{j=1}^n x_{i1} x_{j2}}{\sum_{k=1}^n (1 - x_{k1}) \sum_{l=1}^n (1 - x_{l2})} = \frac{\sum_{j=1}^n x_{j2}}{\sum_{k=1}^n (1 - x_{k1})} - \frac{\sum_{i=1}^n \sum_{j=1}^n x_{i1} x_{j2}}{\sum_{k=1}^n (1 - x_{k1}) \sum_{l=1}^n (1 - x_{l2})}$$ dividing by the fraction of the left side gives $$\theta_2 = \\ \frac{\sum_{j=1}^n x_{j2} \sum_{l=1}^n(1-x_{l2})}{ \sum_{k=1}^n (1-x_{k1}) \sum_{l=1}^n (1-x_{l2}) - \sum_{i=1}^n \sum_{j=1}^n x_{i1} x_{j2}} - \frac{\sum_{i=1}^n \sum_{j=1}^n x_{i1} x_{j2}}{ \sum_{k=1}^n (1-x_{k1}) \sum_{l=1}^n (1-x_{l2}) - \sum_{i=1}^n \sum_{j=1}^n x_{i1} x_{j2}} \\ = \frac{\sum_{j=1}^n x_{j2} \sum_{l=1}^n(1-x_{l2}) - \sum_{i=1}^n \sum_{j=1}^n x_{i1} x_{j2}}{n^2 - n \sum_{k=1}^n x_{k1} - n \sum_{l=1}^n x_{l2}} \\ = \frac{\sum_{j=1}^n x_{j2} (n - \sum_{l=1}^n x_{l2} - \sum_{i=1}^n x_{i1})}{n(n - \sum_{k=1}^n x_{k1} - \sum_{l=1}^n x_{l2}} \\ = \frac{1}{n} \sum_{j=1}^n x_{j2} $$

and inserting back into $\theta_1$ gives $$\theta_1 = \frac{\sum_{i=1}^n (x_{i1} - \frac{1}{n} \sum_{j=1}^n x_{j2}x_{i1})}{\sum_{i=1}^n (1 - x_{i2})} = \frac{\sum_{i=1}^n x_{i1} (1 - \frac{1}{n} \sum_{j=1}^n x_{j2})}{n(1 - \frac{1}{n} \sum_{i=1}^n x_{i2})} = \frac{1}{n} \sum_{i=1}^n x_{i1}.$$