The question stems from the text Pattern Recognition and Machine Learning by Christopher Bishop, Chapter 3.1.1. To say my maths is rusty is an understatement so I would appreciate any nudge in the right direction regarding the below.
The logarithm of the likelihood functions of a standard univariate Gaussian is given below:
$$(1)\quad \frac{N}{2} \ln \beta - \frac{N}{2}\ln 2 \pi - \frac{\beta}{2}\sum_{n=1}^N\{t_n - \mathbf w^T \mathbf \phi(\mathbf x_n) \}^2$$
Taking the gradient(derivative) with respect to $\mathbf w$ and setting this equal to zero we get:
$$(2)\quad0 = \sum_{n=1}^N\{t_n - \mathbf w^T \mathbf \phi(\mathbf x_n) \}\phi(\mathbf x_n)^T $$
$$(3)\quad0 = \sum_{n=1}^Nt_n \phi(\mathbf x_n)^T - \mathbf w^T(\sum_{n=1}^N \mathbf \phi(\mathbf x_n) \phi(\mathbf x_n)^T)$$
Solving for $\mathbf w$ we get:
$$(4)\quad\mathbf w_{ML}=(\mathbf \Phi^T \mathbf \Phi)^{-1} \mathbf \Phi^T \mathbf {\mathtt t}$$
where
$$(5)\quad\mathbf \Phi = \phi_j(\mathbf x_n)$$
Now for the questions.
(a) In (2) when using the chain rule why do we get $\phi(\mathbf x_n)^T$ instead of $\phi(\mathbf x_n)?$
(b) Is the following valid to get from (3) to (4)? Continuing from (3)
$$\mathbf w^T(\sum_{n=1}^N \mathbf \phi(\mathbf x_n) \phi(\mathbf x_n)^T) = \sum_{n=1}^Nt_n \phi(\mathbf x_n)^T$$
$$[\mathbf w^T(\sum_{n=1}^N \mathbf \phi(\mathbf x_n) \phi(\mathbf x_n)^T)]^T = [\sum_{n=1}^Nt_n \phi(\mathbf x_n)^T]^T$$
$$\mathbf w_{ML}(\sum_{n=1}^N \mathbf \phi(\mathbf x_n) \phi(\mathbf x_n)^T) = \sum_{n=1}^N \phi(\mathbf x_n) t_n $$
Where $\sum_{n=1}^N \mathbf \phi(\mathbf x_n) \phi(\mathbf x_n)^T$ = $(\mathbf \Phi^T \mathbf \Phi)$ and $\sum_{n=1}^N \phi(\mathbf x_n) t_n = \mathbf \Phi^T \mathbf {\mathtt t}$
$$\mathbf w_{ML}(\mathbf \Phi^T \mathbf \Phi)(\mathbf \Phi^T \mathbf \Phi)^{-1}=(\mathbf \Phi^T \mathbf \Phi)^{-1} \mathbf \Phi^T \mathbf {\mathtt t}$$
$$(4)\quad\mathbf w_{ML}=(\mathbf \Phi^T \mathbf \Phi)^{-1} \mathbf \Phi^T \mathbf {\mathtt t}$$
If it is not valid can you please point out where.
(a) no reason, really. Expressing the gradient as a row vector sets up for what is done in step 3, but you could also have is proportional to $\phi(x_n)$ and then write $(t_n-w^T\phi(x_n))^2 = (t_n-\phi(x_n)^Tw)^2$ and then multiply this scalar by the column vector $\phi(x_n)$ on the left.
(b) Yes, but in the third line $w_{ML}$ should be on the right of $\Phi^T\Phi.$ Also you should probably be explicit about which index is which here. $\Phi = \phi_j(x_n)$ seems to suggest the second index is $n$ but what you really have in this notation is $\Phi_{nj} = \phi_j(x_n).$