Derivation of normal equations for maximum likelihood and least squares

581 Views Asked by At

The question stems from the text Pattern Recognition and Machine Learning by Christopher Bishop, Chapter 3.1.1. To say my maths is rusty is an understatement so I would appreciate any nudge in the right direction regarding the below.

The logarithm of the likelihood functions of a standard univariate Gaussian is given below:

$$(1)\quad \frac{N}{2} \ln \beta - \frac{N}{2}\ln 2 \pi - \frac{\beta}{2}\sum_{n=1}^N\{t_n - \mathbf w^T \mathbf \phi(\mathbf x_n) \}^2$$

Taking the gradient(derivative) with respect to $\mathbf w$ and setting this equal to zero we get:

$$(2)\quad0 = \sum_{n=1}^N\{t_n - \mathbf w^T \mathbf \phi(\mathbf x_n) \}\phi(\mathbf x_n)^T $$

$$(3)\quad0 = \sum_{n=1}^Nt_n \phi(\mathbf x_n)^T - \mathbf w^T(\sum_{n=1}^N \mathbf \phi(\mathbf x_n) \phi(\mathbf x_n)^T)$$

Solving for $\mathbf w$ we get:

$$(4)\quad\mathbf w_{ML}=(\mathbf \Phi^T \mathbf \Phi)^{-1} \mathbf \Phi^T \mathbf {\mathtt t}$$

where

$$(5)\quad\mathbf \Phi = \phi_j(\mathbf x_n)$$

Now for the questions.

(a) In (2) when using the chain rule why do we get $\phi(\mathbf x_n)^T$ instead of $\phi(\mathbf x_n)?$

(b) Is the following valid to get from (3) to (4)? Continuing from (3)

$$\mathbf w^T(\sum_{n=1}^N \mathbf \phi(\mathbf x_n) \phi(\mathbf x_n)^T) = \sum_{n=1}^Nt_n \phi(\mathbf x_n)^T$$

$$[\mathbf w^T(\sum_{n=1}^N \mathbf \phi(\mathbf x_n) \phi(\mathbf x_n)^T)]^T = [\sum_{n=1}^Nt_n \phi(\mathbf x_n)^T]^T$$

$$\mathbf w_{ML}(\sum_{n=1}^N \mathbf \phi(\mathbf x_n) \phi(\mathbf x_n)^T) = \sum_{n=1}^N \phi(\mathbf x_n) t_n $$

Where $\sum_{n=1}^N \mathbf \phi(\mathbf x_n) \phi(\mathbf x_n)^T$ = $(\mathbf \Phi^T \mathbf \Phi)$ and $\sum_{n=1}^N \phi(\mathbf x_n) t_n = \mathbf \Phi^T \mathbf {\mathtt t}$

$$\mathbf w_{ML}(\mathbf \Phi^T \mathbf \Phi)(\mathbf \Phi^T \mathbf \Phi)^{-1}=(\mathbf \Phi^T \mathbf \Phi)^{-1} \mathbf \Phi^T \mathbf {\mathtt t}$$

$$(4)\quad\mathbf w_{ML}=(\mathbf \Phi^T \mathbf \Phi)^{-1} \mathbf \Phi^T \mathbf {\mathtt t}$$

If it is not valid can you please point out where.

1

There are 1 best solutions below

0
On BEST ANSWER

(a) no reason, really. Expressing the gradient as a row vector sets up for what is done in step 3, but you could also have is proportional to $\phi(x_n)$ and then write $(t_n-w^T\phi(x_n))^2 = (t_n-\phi(x_n)^Tw)^2$ and then multiply this scalar by the column vector $\phi(x_n)$ on the left.

(b) Yes, but in the third line $w_{ML}$ should be on the right of $\Phi^T\Phi.$ Also you should probably be explicit about which index is which here. $\Phi = \phi_j(x_n)$ seems to suggest the second index is $n$ but what you really have in this notation is $\Phi_{nj} = \phi_j(x_n).$