Maximum Likelihood Estimator of Linear Discriminant Analysis covariance matrix

66 Views Asked by At

I'm trying to calculate the maximum likelihood estimator for linear discriminant analysis but I'm not understanding why my result does not match that of the book "The Elements By Statistical Learning" by Hastie.

For $p$ predictors the likelihood is:

\begin{equation} L(\pi_1,...,\pi_K,\mu_1,...,\mu_K,\Sigma)=\Pi_{k=1}^K \pi_k^{N_k}\Pi_{i=1,g_i=k}^N f(x_i|\mu_k,\Sigma) \end{equation} So the log likelihood is: \begin{equation} Log L(\pi_1,...,\pi_K,\mu_1,...,\mu_K,\Sigma)=\sum_{k=1}^K N_klog(\pi_k)+\sum_{i=1,g_i=k}^N log\left( \left(\frac{1}{\sqrt{2\pi}}\right)^{Np}\frac{1}{|\Sigma|^{N/2}}\text{exp}\left(-\frac{1}{2}(x_i-\mu_k)^T\Sigma^{-1}(x_i-\mu_k)\right)\right) \end{equation}

\begin{equation} =\sum_{k=1}^K N_klog(\pi_k)-\frac{Np}{2} log(2\pi)- \frac{N}{2}log(|\Sigma|) -\sum_{i=1,g_i=k}^N \frac{1}{2}(x_i-\mu_k)^T\Sigma^{-1}(x_i-\mu_k) \end{equation}

\begin{equation} \frac{\partial}{\partial \Sigma}\left( -\frac{N}{2}log(|\Sigma|) -\sum_{i=1,g_i=k}^N\frac{1}{2}(x_i-\mu_k)^T\Sigma^{-1}(x_i-\mu_k)\right) \end{equation}

\begin{equation} =\frac{\partial}{\partial \Sigma}\left( -\frac{N}{2}log(|\Sigma|)-\sum_{i=1,g_i=k}^N\frac{1}{2}tr\left((x_i-\mu_k)^T\Sigma^{-1}(x_i-\mu_k)\right)\right) \end{equation} Differentiating and setting to zero we get: \begin{equation} -\frac{N}{2}\Sigma^{-1} - \sum_{i=1,g_i=k}^N - \frac{1}{2} \Sigma^{-1}(x_i-\mu_k)(x_i-\mu_k)^T\Sigma^{-1}=0 \end{equation}

\begin{equation} -\frac{N}{2}\Sigma + \sum_{i=1,g_i=k}^N\frac{1}{2} (x_i-\mu_k)(x_i-\mu_k)^T=0 \end{equation}

So the maximum likelihood estimator for Sigma is: \begin{equation} \Sigma = \frac{1}{N}\sum_{i=1,g_i=k}^N(x_i-\mu_k)(x_i-\mu_k)^T \end{equation}

Yet Hastie gives his as

\begin{equation} \Sigma = \frac{1}{N}\sum^K_{k=1}\sum_{i=1,g_i=k}^N(x_i-\mu_k)(x_i-\mu_k)^T \end{equation}

What am I missing?