I am trying to derive negative log likelihood of Gaussian Naive Bayes classifier and the derivatives of the parameters.
So there are class labels $y \in {1, ..., k}$, and real valued vector of $d$ features $\textbf{x} = (x_1, ..., x_d)$.
And the dataset $D = \{(y^1, \textbf{x}^1), ..., (y^N, \textbf{x}^N)\}$.
The parameters are $\theta =\{\boldsymbol\alpha, \boldsymbol\mu, \boldsymbol\sigma\}$.
$\boldsymbol\alpha = (\alpha_1, ..., \alpha_k)^T$ where $p(y=k) = \alpha_k$.
$\boldsymbol\mu = \begin{pmatrix} \mu_{11} & ... & \mu_{1d}\\ \vdots & & \vdots\\ \mu_{k1} & ... & \mu_{kd} \end{pmatrix}$ where each $(j, i)$ entry of $\mu$ represent mean of $i_{th}$ feature of $x$ with $j_{th}$ class label.
$\boldsymbol\sigma = (\sigma_{1}^2, ..., \sigma_{d}^2)$ which represent the shared variances of each features.
$\\$
For negative log likelihood, this is what I've got so far.
First let $\sum$ be the diagonal matrix of $\sigma$ and $\mu_k$ be $k_{th}$ row of $\mu$.
$L(\theta; D) = -\sum_{m=1}^{N} log (p(y^m, \textbf{x}^m|\theta)) = -\sum_{m=1}^{N} log (p(y^m|\theta)) - \sum_{m=1}^{N} log (p(\textbf{x}^m|y^m,\theta)) \\ = -\sum_{m=1}^{N} log (\alpha_{y^m}) - log (\frac{1}{\sqrt{(2\pi)^d*|\sum|}})\sum_{m=1}^{N}\frac{-(\textbf{x}^m-\mu_{y^{m}})\sum^{-1}(\textbf{x}^m-\mu_{y^{m}})^T}{2}$
First off, I am not quite sure if the NLL I got is correct in the first place.
Secondly, I have no idea how would I solve $\frac{\partial L}{\partial \boldsymbol\mu}, \frac{\partial L}{\partial \boldsymbol\sigma}, and \frac{\partial L}{\partial \sigma_i^2}$.
Thanks for the help in advance!
The naive Bayes classifier determine the class (i.e. the $k$) that maximizes the posterior probability $$ \mathbb{P}(y=k|\mathbf{x}) = \frac{\mathbb{P}(\mathbf{x}|y=k)\mathbb{P}(y=k)}{\mathbb{P}(\mathbf{x})} $$ As you can observe, only the numerator matters in this perspective.
The likelihood $\mathbb{P}(\mathbf{x}|y=k)$ is computed by evaluating $\mathcal{N}(\mathbf{x};\mu_k,\mathbf{\Sigma}_k)$ where the covariance is assumed to be diagonal in the naive case.
Because NB is a supervised classification, you simply need to isolate examples from a given class and estimate the classwise parameters in the standard way.