I am having difficulty seeing how the authors (in Appendix A.3.2 under "Variational Dirichlet" of this paper) maximise the function $L$ with respect to $\gamma_i$ to derive a solution for $\gamma_i$.
$L_{[\gamma]}$ refers to only those parts of the function $L$ which contain $\gamma_i$:
$$L_{[\gamma]} = \sum^k_{i=1} (\Psi(\gamma_i) - \Psi(\textstyle \sum^k_{j=1} \gamma_j))(\alpha_i + \sum^N_{n=1} \phi_{ni} - \gamma_i) - \log \Gamma(\textstyle \sum^k_{j=1} \gamma_j) + \log \Gamma(\gamma_i)$$
Where $\Psi(\cdot)$ is the digamma function and $\Gamma(\cdot)$ is the gamma function. They evaluate partial derivatives with respect to $\gamma_i$ to get the following:
$$\frac{\partial L}{\partial \gamma_i} = \Psi'(\gamma_i) (\alpha_i + \textstyle \sum^N_{n=1} \phi_{ni} - \gamma_i) - \Psi'(\textstyle \sum^k_{j=1} \gamma_j) \displaystyle \sum^k_{j=1} (\alpha_j + \textstyle \sum^N_{n=1} \phi_{nj} - \gamma_j) $$
Where they have used the result that the derivative of $\log \Gamma(x)$ with respect to $x$ is $\Psi(x)$, and where $\Psi'(\cdot)$ is the 1st derivative of the digamma function. I have checked the above derivation, and it am satisfied with this.
However, I am now struggling to see how the authors go from the above line to derive a maximum in the following line.
They then set this equal to 0, yielding a maximum with respect to $\gamma_i$:
$$\gamma_i = \alpha_i + \sum^N_{n=1} \phi_{ni}$$
My attempt (updated to correct algebraic error).
The paper is well-known and is frequently cited in the machine learning community, so it's almost certainly due to something I'm missing rather than a typo or error.
Defining an index set $S = \{1, ..., k\}$, I separated the $i$th summand in the right-most summation from the rest of the summands to yield:
$$\begin{align} \frac{\partial L}{\partial \gamma_i} = & \space \Psi'(\gamma_i) (\alpha_i + \textstyle \sum^N_{n=1} \phi_{ni} - \gamma_i) - \Psi'(\textstyle \sum^k_{j=1} \gamma_j) \displaystyle \sum^k_{j=1} (\alpha_j + \textstyle \sum^N_{n=1} \phi_{nj} - \gamma_j) \\ = & \space\Psi'(\gamma_i) (\alpha_i + \textstyle \sum^N_{n=1} \phi_{ni} - \gamma_i) - \Psi'(\textstyle \sum^k_{j=1} \gamma_j) (\alpha_i + \textstyle \sum^N_{n=1} \phi_{ni} - \gamma_i) \\ &- \Psi'(\textstyle \sum^k_{j=1} \gamma_j) \displaystyle \sum_{j \in S \backslash \{i\}} (\alpha_j + \textstyle \sum^N_{n=1} \phi_{nj} - \gamma_j) \\ = & \space (\alpha_i + \textstyle \sum^N_{n=1} \phi_{ni} - \gamma_i) \left( \Psi'(\gamma_i) - \Psi'(\textstyle \sum^K_{j=1} \gamma_j) \right) \\ &- \Psi'(\textstyle \sum^k_{j=1} \gamma_j) \displaystyle \sum_{j \in S \backslash \{i\}} (\alpha_j + \textstyle \sum^N_{n=1} \phi_{nj} - \gamma_j) \\ = & \space 0 \end{align}$$
Now setting this derivative equal to 0, the nature of the argument that the authors use to derive the maximum is currently obtuse to me.
Some assistance would be greatly appreciated.
Further context.
As context, the function $L$ is, in the language of the machine-learning and statistics community, an evidence lower bound (i.e. a bound on a marginal log-likelihood which is analytically intractable to compute), and its arguments consist of model parameters and variational parameters.
My query concerns maximising $L$ with respect to the variational parameter $\gamma_i$. The $\gamma_1, ... , \gamma_k$ are simply (variational) parameters of the Dirichlet distribution. Similarly, the $\alpha_1, ..., \alpha_k$ are (model) parameters of the Dirichlet distribution. The $\phi_{nj}$ are probabilities, so that $\sum^k_{j=1} \phi_{nj} = 1$ (they correspond to the probability of the $n$th word in a document being drawn from the $j$th latent topic).
The maximum derived is part of a two-stage iterative EM-like procedure (in the sense of Dempster et al.) where we maximise $L$ with respect to the variational parameters $\gamma_i$ with model parameters $\alpha_i$ held fixed, followed by maximising $L$ with respect to model parameters $\alpha_i$, with the variational parameters $\gamma_i$ held fixed.