Arriving at Maximum Likelihood Estimates

197 Views Asked by At

I am trying to develop a text classifier and I'm reading about MLE to help me understand the process. I came across this example: enter image description here

and I wanted to try this myself. I'm running into a problem and so here is my approach: I wish to find the MLE for $\tau$, $\mu_j$ and $\sigma_j$. My first step was to determine: $$f =\prod_{i=1}^{n}P(y_i|x_i) = \prod_{i=1}^n\left[P(y_i)P(x_i|y_i) \right]=\prod_{i=1}^n\left[\tau[y_i]\prod_{j=1}^d\left[ \dfrac{1}{\sqrt{2\pi}\sigma_j}e^{-(x_i[j]-\mu_j)^2/(2\sigma_j^2)} \right] \right]$$ and taking the log gives $$\log f=\sum_{i=1}^n \log\tau [y_i]+\sum_{i=1}^n\sum_{j=1}^d\left[\log\left(\dfrac{1}{\sqrt{2\pi}} \right)+\log\left(\dfrac{1}{\sigma_j} \right) - \dfrac{(x_i[j]-\mu_j)^2}{2\sigma_j^2} \right]$$

I use the Lagrange multipliers for my constraint

$$L =\sum_{i=1}^n \log\tau [y_i]+\sum_{i=1}^n\sum_{j=1}^d\left[\log\left(\dfrac{1}{\sqrt{2\pi}} \right)+\log\left(\dfrac{1}{\sigma_j} \right) - \dfrac{(x_i[j]-\mu_j)^2}{2\sigma_j^2} \right] - \lambda\left(\sum_{i=1}^k \tau[i] - 1 \right)$$

This is where I am stuck. From what I understand, if I want to find an estimate for $\tau$, I need to determine and solve: $$\dfrac{\partial L}{\partial\tau[i]} = 0$$

I don't really know what to do here since my $\tau$ appears with different indices. Any ideas on how to get in the right direction?

1

There are 1 best solutions below

0
On BEST ANSWER

Since $y$ and $x$ are independent, you can treat the two MLE problems separately, so in this case you can simplify your Lagrangian function for the former case by removing the parts for the sampling density of the $x$ values. Let $n[j] \equiv \sum_{i=1}^n \mathbb{I}(y_i=j)$ be the count of the number of sample values that are equal to $j$. Then we can write the Lagrangian function for the MLE of $\tau$ as:

$$\begin{equation} \begin{aligned} \mathscr{L}_y(\tau, \lambda) &= \sum_{i=1}^n \log \tau[{y_i}] - \lambda \Big( \sum_{j=1}^k \tau[j] - 1 \Big) \\[6pt] &= \sum_{j=1}^k n[j] \log \tau[j] - \lambda \Big( \sum_{j=1}^k \tau[j] - 1 \Big) \\[6pt] &= \sum_{j=1}^k \Big( n[j] \log \tau[j] - \lambda \tau[j] \Big) + \lambda. \\[6pt] \end{aligned} \end{equation}$$

Differentiating with respect to the vector $\tau$ gives the gradient and Hessian of the Lagrangian function:

$$\begin{equation} \begin{aligned} \nabla_\tau \mathscr{L}_y(\tau, \lambda) &= \begin{bmatrix} n[1] / \tau[1] - \lambda \\ \vdots \\ n[k] / \tau[k] - \lambda \\ \end{bmatrix} \quad \nabla_\tau^2 \mathscr{L}_y(\tau, \lambda) = - \begin{bmatrix} n[1] / \tau[1]^2 & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & n[k] / \tau[k]^2 \\ \end{bmatrix}. \end{aligned} \end{equation}$$

The Hessian matrix of the Lagrangian function is negative definite so the Lagrangian function is a strictly concave function. The maximizing argument for the function occurs at the sole critical point $\nabla_\tau \mathscr{L}_y(\hat{\tau}, \hat{\lambda}) = 0$, which gives:

$$\begin{equation} \begin{aligned} \frac{n[1]}{\hat{\tau}[1]} = \frac{n[2]}{\hat{\tau}[2]} = \cdots = \frac{n[k]}{\hat{\tau}[k]} = \hat{\lambda}. \end{aligned} \end{equation}$$

Combining this with the constraint $\sum_j n[j] = n$ gives the MLE:

$$\hat{\tau}[1] = \frac{n[1]}{n} \quad \quad \quad \hat{\tau}[2] = \frac{n[2]}{n} \quad \quad \quad \cdots \quad \quad \quad \hat{\tau}[k] = \frac{n[k]}{n}.$$

(We can also see that $\hat{\lambda} = n$.) We can see that the MLE is just the vector of sample proportions of the outcomes of $y_1,...,y_n$, which is an unsurprising outcome. This is the form of MLE that arises whenever you take the MLE of a categorical distribution.