I'm trying to calculate the maximum likelihood estimator for linear discriminant analysis but I'm not understanding why my result does not match that of the book "The Elements By Statistical Learning" by Hastie.
For $p$ predictors the likelihood is:
\begin{equation} L(\pi_1,...,\pi_K,\mu_1,...,\mu_K,\Sigma)=\Pi_{k=1}^K \pi_k^{N_k}\Pi_{i=1,g_i=k}^N f(x_i|\mu_k,\Sigma) \end{equation} So the log likelihood is: \begin{equation} Log L(\pi_1,...,\pi_K,\mu_1,...,\mu_K,\Sigma)=\sum_{k=1}^K N_klog(\pi_k)+\sum_{i=1,g_i=k}^N log\left( \left(\frac{1}{\sqrt{2\pi}}\right)^{Np}\frac{1}{|\Sigma|^{N/2}}\text{exp}\left(-\frac{1}{2}(x_i-\mu_k)^T\Sigma^{-1}(x_i-\mu_k)\right)\right) \end{equation}
\begin{equation} =\sum_{k=1}^K N_klog(\pi_k)-\frac{Np}{2} log(2\pi)- \frac{N}{2}log(|\Sigma|) -\sum_{i=1,g_i=k}^N \frac{1}{2}(x_i-\mu_k)^T\Sigma^{-1}(x_i-\mu_k) \end{equation}
\begin{equation} \frac{\partial}{\partial \Sigma}\left( -\frac{N}{2}log(|\Sigma|) -\sum_{i=1,g_i=k}^N\frac{1}{2}(x_i-\mu_k)^T\Sigma^{-1}(x_i-\mu_k)\right) \end{equation}
\begin{equation} =\frac{\partial}{\partial \Sigma}\left( -\frac{N}{2}log(|\Sigma|)-\sum_{i=1,g_i=k}^N\frac{1}{2}tr\left((x_i-\mu_k)^T\Sigma^{-1}(x_i-\mu_k)\right)\right) \end{equation} Differentiating and setting to zero we get: \begin{equation} -\frac{N}{2}\Sigma^{-1} - \sum_{i=1,g_i=k}^N - \frac{1}{2} \Sigma^{-1}(x_i-\mu_k)(x_i-\mu_k)^T\Sigma^{-1}=0 \end{equation}
\begin{equation} -\frac{N}{2}\Sigma + \sum_{i=1,g_i=k}^N\frac{1}{2} (x_i-\mu_k)(x_i-\mu_k)^T=0 \end{equation}
So the maximum likelihood estimator for Sigma is: \begin{equation} \Sigma = \frac{1}{N}\sum_{i=1,g_i=k}^N(x_i-\mu_k)(x_i-\mu_k)^T \end{equation}
Yet Hastie gives his as
\begin{equation} \Sigma = \frac{1}{N}\sum^K_{k=1}\sum_{i=1,g_i=k}^N(x_i-\mu_k)(x_i-\mu_k)^T \end{equation}
What am I missing?