I'm trying to calculate the maximum likelihood estimators of the prior probabilities of the following likelihood function:
$L(p_1,...,p_K,\mu_1,...,\mu_K,\Sigma)=\Pi_{k=1}^K p_k^{N_k}\Pi_{k=1}^K f(x_i|\mu_k,\Sigma) $
taking the logarithm:
$logL(p_1,...,p_K,\mu_1,...,\mu_K,\Sigma)=\Sigma_{k=1}^K {N_k}log(p_k)+\Sigma_{k=1}^K logf(x_i|\mu_k,\Sigma) $
using the fact that $p_i=1-\Sigma_{j\neq i}^K p_j$ and differentiating with respect to $p_k$ and setting to zero we get:
$\frac{N_k}{p_k}-\Sigma_{i\neq k}^K\frac{N_i}{p_i}=0$
I know that the final answer should be $p_k=\frac{N_k}{N}$ and this can be easily seen however I don't see how the second term in the above equation can be proven to be equal to $N$.
When you're new to functions of probability measures, I'd propose to embed $\mathcal P=\{p\in[0,1]^K:\|p\|_1=1\}\subseteq\mathbb R^{K-1}$. Formally, we consider $p:\{x\in[0,1]^{K-1}:\|x\|_1\le 1\}\rightarrow\mathcal P$, $x\mapsto (x_1,\dots,x_{K-1},1-\|x\|_1)$. So, $p_K$ is not a variable in the function, it is a function $p_K(x)=1-\|x\|_1$ and the partial derivatives are simply $\frac{\partial p_K}{\partial x_k}=-1$ for $k=1,\dots,K-1$ (we only have $K-1$ free variables). For the others we simply have $p_k(x)=x_k$, the identity. When taking your derivative, you have to apply the chain rule. This means, for $f=f(x,\mu,\Sigma)=\log(L(p(x),\mu,\Sigma))$ we get $$\frac{\partial f}{\partial x_k}=\frac{\partial N_k\log(x_k)}{\partial x_k}+\frac{\partial N_K\log(p_K(x))}{\partial x_k}=\frac{N_k}{x_k}+\frac{N_K}{p_K(x)}\cdot(-1)=\frac{N_k}{p_k}-\frac{N_K}{p_K},$$ with the "shorthand" $p_k(x)=p_k$ and $p_K(x)=p_K$. If you now set all the $K-1$ derivatives to $0$, you get $\frac{N_k}{p_k}=\frac{N_K}{p_K}$ for all $k=1,\dots,K-1$, meaning that these ratios are all equal (since the RHS is always the same). We can also write $p_k=\frac{N_k}{N_K}p_K$, and now normalization gives $1=\sum_kp_k=\sum_k\frac{N_k}{N_K}p_K$, so $p_K=\frac{N_K}{N}$ with $N=\sum_kN_k$. Plugging this in gives $p_k=\frac{N_k}{N_K}p_K=\frac{N_k}{N}$.
Just in case you're interested: If you have done this a few times, or if you prefer, just remember that $\mathcal P\subseteq\mathbb R^K$ is a linear affine subspace, spanned by the differences $e_i-e_j$ of unit vectors, respectively given by all vectors $v\in\mathbb R^K$ with $\sum_kv_k=0$ (orthogonal to $(1)_k$). Thus, you have to consider a basis of these directional derivatives (we took $e_k-e_K$ in the above) when looking for stationary points, and you can use that the derivatives vanish on the entire subspace for stationary points.