Negative Log likelihood and Derivative of Gaussian Naive Bayes

296 Views Asked by At

I am trying to derive negative log likelihood of Gaussian Naive Bayes classifier and the derivatives of the parameters.

So there are class labels $y \in {1, ..., k}$, and real valued vector of $d$ features $\textbf{x} = (x_1, ..., x_d)$.

And the dataset $D = \{(y^1, \textbf{x}^1), ..., (y^N, \textbf{x}^N)\}$.

The parameters are $\theta =\{\boldsymbol\alpha, \boldsymbol\mu, \boldsymbol\sigma\}$.

$\boldsymbol\alpha = (\alpha_1, ..., \alpha_k)^T$ where $p(y=k) = \alpha_k$.

$\boldsymbol\mu = \begin{pmatrix} \mu_{11} & ... & \mu_{1d}\\ \vdots & & \vdots\\ \mu_{k1} & ... & \mu_{kd} \end{pmatrix}$ where each $(j, i)$ entry of $\mu$ represent mean of $i_{th}$ feature of $x$ with $j_{th}$ class label.

$\boldsymbol\sigma = (\sigma_{1}^2, ..., \sigma_{d}^2)$ which represent the shared variances of each features.

$\\$

For negative log likelihood, this is what I've got so far.

First let $\sum$ be the diagonal matrix of $\sigma$ and $\mu_k$ be $k_{th}$ row of $\mu$.

$L(\theta; D) = -\sum_{m=1}^{N} log (p(y^m, \textbf{x}^m|\theta)) = -\sum_{m=1}^{N} log (p(y^m|\theta)) - \sum_{m=1}^{N} log (p(\textbf{x}^m|y^m,\theta)) \\ = -\sum_{m=1}^{N} log (\alpha_{y^m}) - log (\frac{1}{\sqrt{(2\pi)^d*|\sum|}})\sum_{m=1}^{N}\frac{-(\textbf{x}^m-\mu_{y^{m}})\sum^{-1}(\textbf{x}^m-\mu_{y^{m}})^T}{2}$

First off, I am not quite sure if the NLL I got is correct in the first place.

Secondly, I have no idea how would I solve $\frac{\partial L}{\partial \boldsymbol\mu}, \frac{\partial L}{\partial \boldsymbol\sigma}, and \frac{\partial L}{\partial \sigma_i^2}$.

Thanks for the help in advance!

1

There are 1 best solutions below

0
On

The naive Bayes classifier determine the class (i.e. the $k$) that maximizes the posterior probability $$ \mathbb{P}(y=k|\mathbf{x}) = \frac{\mathbb{P}(\mathbf{x}|y=k)\mathbb{P}(y=k)}{\mathbb{P}(\mathbf{x})} $$ As you can observe, only the numerator matters in this perspective.

The likelihood $\mathbb{P}(\mathbf{x}|y=k)$ is computed by evaluating $\mathcal{N}(\mathbf{x};\mu_k,\mathbf{\Sigma}_k)$ where the covariance is assumed to be diagonal in the naive case.

Because NB is a supervised classification, you simply need to isolate examples from a given class and estimate the classwise parameters in the standard way.