TL;DR: How would you show that a histogram for observations of a discrete random variable, is its maximum-likelihood non-parametric estimation?
In more details: I would like to obtain a maximum-likelihood estimate for a (non-parametric) distribution of a discrete random variable $X: X\in[1\ldots K]$, from a set of $N$ i.i.d observations $\underline x$
I.e.
$$\hat{\underline\theta} = \operatorname*{argmax}_{\underline\theta} \Pr(\underline x;\underline\theta), \text{ where } \Pr(X=k; \underline\theta) = \theta_k, \quad k \in [1\ldots K]$$
I would expect the solution to be: $\hat{\theta_k}=\frac{n_k}{N}$, i.e. the fraction of occurrences of $k$ in $N$ observations. However, I failed to get there.
I tried to maximize $\Pr(\underline x\mid \underline\theta)=\prod_{k=1}^K \theta_k^{n_k}$, where $\theta_K=1-\sum_{k=1}^{K-1}\theta_k$ but that led me to a different solution.
Any idea what am I missing?
Lagrange multipliers can handle this.
For $K=1,\ldots,K$ you have $\Pr(X=k) = \theta_k$ and you observe $n$ i.i.d. copies of $X,$ called $X_1,\ldots,X_n.$
The vector $(\theta_1,\ldots,\theta_k)$ is subject to the constraints $\theta_1\ge0,\ldots,\theta_k\ge0$ and $\theta_1+\cdots+\theta_k=1,$ and no others.
You observe $X_i=x_i\in\{1,\ldots,K\}$ for $i=1,\ldots,n.$ Then you have $$ L(\theta_1,\ldots,\theta_k) = \prod_{i=1}^n \Pr(X_i=x_i) = \prod_{i=1}^n \theta_{x_i} = \prod_{k=1}^K \theta_k^{n_k} $$ where $$ n_k = \text{how many of $x_1,\ldots,x_n$ are equal to $k$.} $$ You know that $$ n_1+\cdots+n_K = n. $$ Given $x_1,\ldots,x_n,$ the problem is to find the value $(\widehat\theta_1,\ldots,\widehat\theta_k)$ of $(\theta_1,\ldots,\theta_k)$ that maximizes $L$ subject to the constraints above.
We would like to show $\widehat\theta_k = \dfrac{n_k} n.$
We have $$ \ell = \log L = \sum_{k=1}^K n_k\log\theta_k. $$ $$ \frac{\partial\ell}{\partial\theta_k} = \frac{n_k}{\theta_k}, $$ So the gradient is $$ \nabla\ell = \left( \frac{n_1}{\theta_1},\ldots,\frac{n_k}{\theta_k} \right). $$
And we have the constraint on the sum $\theta_1+\cdots+\theta_K = 1.$ And $$ \frac \partial {\partial\theta_k} (\theta_1+\cdots+\theta_k) = 1, $$ so the gradient is $(1,\ldots,1).$
You need a value of $(\theta_1,\ldots,\theta_k)$ that satifies the constraint on the sum and for which the two gradients are scalar multiples of each other. The only way the gradient of $\ell$ can be parallel to $(1,\ldots,1)$ is for the values of $n_i/\theta_i$ all to be equal to each other, so $\theta_i= n_i\cdot c$ for some $c$ that doesn't depend on $i\in\{1,\ldots,K\}.$ And $c$ must be chosen so that the constraint is satisfied that says the sum of the $\theta$s is $1.$