How to do the derivation of the MLE for Linear Discriminant Analysis

98 Views Asked by At

Derivation of the MLE for Linear Discriminant Analysis

$$ \ell(\phi, \mu, \Sigma) = \log \prod_{i=1}^{M} p(x^{(i)}, y^{(i)}; \phi, \mu, \Sigma) $$

$$ = \log \prod_{i=1}^{M} p(x^{(i)}|y^{(i)}; \mu, \Sigma) p(y^{(i)}; \phi) $$

$$ = \log \prod_{i=1}^{M} \frac{1}{\sqrt{2\pi}^N |\Sigma|} \exp\left(-\frac{1}{2}(x^{(i)} - \mu_{y^{(i)}})^T \Sigma^{-1} (x^{(i)} - \mu_{y^{(i)}})\right) \prod_{c=1}^{C} \phi^{I[y^{(i)}=c]} $$

$$ = \sum_{i=1}^{M} \left[-\frac{N}{2} \log(2\pi) - \frac{1}{2} \log|\Sigma| - \frac{1}{2}(x^{(i)} - \mu_{y^{(i)}})^T \Sigma^{-1} (x^{(i)} - \mu_{y^{(i)}}) + \sum_{c=1}^{C} I[y^{(i)} = c] \log \phi_c\right]. $$

Now we need to take partial derivatives with respect to each parameter and equate it to zero. For $\mu_c$,

$$ \frac{\partial\ell(\phi, \mu_c, \Sigma)}{\partial \mu_c} = \sum_{i=1}^{M} I[y^{(i)} = c] \Sigma^{-1}(x^{(i)} - \mu_c) = 0. $$

My question

So my question is, I don't really know how to do derivation of this kind of vector-equation aka going from this step: $$ = \sum_{i=1}^{M} \left[-\frac{N}{2} \log(2\pi) - \frac{1}{2} \log|\Sigma| - \frac{1}{2}(x^{(i)} - \mu_{y^{(i)}})^T \Sigma^{-1} (x^{(i)} - \mu_{y^{(i)}}) + \sum_{c=1}^{C} I[y^{(i)} = c] \log \phi_c\right]. $$ to this step: $$ \frac{\partial\ell(\phi, \mu_c, \Sigma)}{\partial \mu_c} = \sum_{i=1}^{M} I[y^{(i)} = c] \Sigma^{-1}(x^{(i)} - \mu_c) = 0. $$

I know basic linear algebra and calculus, but I have not encountered this kind of derivation problem before, and I don't know where to learn it, I have been stuck here for a long time, can someone provide me a step-by-step proof of going from that step to this step please?

1

There are 1 best solutions below

0
On BEST ANSWER

The notation is a bit confusing since the log likelihood is written in terms of the collection of vectors $\mu$, but then the derivative is written in terms of a single vector (which they call $\mu_c$). Recall that since we have multiple classes here we have a center for each class: $\mu_c$.

In any case, the way to proceed is to remember that we want the derivative of our function with respect to the variable $\mu_c$ and proceed as usual:

$$\frac{\partial}{\partial \mu_c}\left ( \sum_{i=1}^{M} \left[-\frac{N}{2} \log(2\pi) - \frac{1}{2} \log|\Sigma| - \frac{1}{2}(x^{(i)} - \mu_{y^{(i)}})^T \Sigma^{-1} (x^{(i)} - \mu_{y^{(i)}}) + \sum_{c=1}^{C} I[y^{(i)} = c] \log \phi_c\right]\right ) = \frac{\partial}{\partial \mu_c}\left ( \sum_{i=1}^{M}- \frac{1}{2}(x^{(i)} - \mu_{y^{(i)}})^T \Sigma^{-1} (x^{(i)} - \mu_{y^{(i)}})\right )$$

where the equality follows since the other terms are a constant with respect to $\mu_c$. Now since the derivative is a linear operator, we can take the derivative of each term in the sum separately. Further, each term will be a constant with respect to $\mu_c$ unless $\mu_{y^(i)} = \mu_c$, so:

$$=- \frac{1}{2}\sum_{i=1}^{M}\frac{\partial}{\partial \mu_c}(x^{(i)} - \mu_{y^{(i)}})^T \Sigma^{-1} (x^{(i)} - \mu_{y^{(i)}}) = \sum_{i=1}^{M}I(y^{(i)} = c) \Sigma^{-1} (x^{(i)} - \mu_{y^{(i)}})$$

Where the last step follows by the chain rule and the fact that for symmetric $A$:

$$\frac{\partial}{\partial x} x^TAx = 2Ax$$

Let me know which of these steps are confusing for you and I can clarify.

Finally, see chapter 2 of the Matrix Cookbook: https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf (specifically equation 81 in the case where $B$ is symmetric)