Why does does the following gradient equal transpose of phi?
$\nabla_w \bigg[w^T\phi\bigg] = \phi^T$
instead of just phi?
$\nabla_w \bigg[w^T\phi\bigg] = \phi$
as in minimizing sum-of-squares error problem..
$$\ln p(t|w, \beta) = \prod^{N}_{n=1} \ln \Bigg[ \bigg( \frac{1}{(2\pi\sigma^2)^{1/2}}\bigg)exp\bigg( \frac{-1}{2\sigma^2}(t_n-\mu)^2 \bigg) \Bigg]$$
$$\ln p(t|w, \beta) = \sum^{N}_{n=1} \bigg[ \frac{-1}{2}\ln(2\pi\sigma^2) + \frac{-1}{2\sigma^2}(t_n-\mu)^2 \bigg]$$
$$\ln p(t|w, \beta) = \sum^{N}_{n=1} \bigg[ \frac{-1}{2\sigma^2}(t_n-\mu)^2 + \frac{-1}{2}\ln(2\pi\sigma^2)\bigg]$$
$$\ln p(t|w, \beta) = \frac{-1}{2\sigma^2} \sum^{N}_{n=1} \bigg[ (t_n-\mu)^2 \bigg] + \frac{-N}{2}\ln(2\pi\sigma^2)$$
$$\ln p(t|w, \beta) = -\frac{1}{2\sigma^2}\sum^{N}_{n=1} \bigg[ (t_n-\mu)^2 \bigg] - \frac{N}{2}ln(\sigma^2) - \frac{N}{2}\ln(2\pi)$$
$$\ln p(t|w, \beta) = -\beta \bigg(\frac{1}{2}\sum^{N}_{n=1} (t_n-w^T\phi(x_n))^2 \bigg) + \frac{N}{2}ln(\beta) - \frac{N}{2}\ln(2\pi)$$
$$E(w) = \frac{1}{2}\sum^{N}_{n=1} (t_n-w^T\phi(x_n))^2$$
$$\ln p(t|w, \beta) = -\beta E(w) + \frac{N}{2}ln(\beta) - \frac{N}{2}\ln(2\pi)$$
Now, I want to maximize the probability with respect to weighting coefficients w. To do this we take the Gradient with respect to w, and then find the value of w when the gradient is zero, this will maximize the probability or likelihood.
$$\nabla_w \bigg( -\beta E(w) + \frac{N}{2}ln(\beta) - \frac{N}{2}\ln(2\pi) \bigg) = 0$$
$$-\beta~\nabla_w E(w) = 0 $$
$$-\frac{\beta}{2}\sum^{N}_{n=1} \nabla_w \bigg((t_n-w^T\phi(x_n))^2 \bigg)=0$$
$$\beta\sum^{N}_{n=1} \nabla_w \bigg((t_n-w^T\phi(x_n))^2 \bigg)=0$$
$$\beta\sum^{N}_{n=1} 2(t_n-w^T\phi(x_n)) \nabla_w \bigg(t_n-w^T\phi(x_n) \bigg) =0$$
$$\beta\sum^{N}_{n=1} (t_n-w^T\phi(x_n)) \nabla_w \bigg(w^T\phi(x_n) \bigg) =0$$
Here's where i get confused....book say result should be:
$$\beta\sum^{N}_{n=1} (t_n-w^T\phi(x_n)) \ \phi(x_n)^T = 0$$
how do you know the gradient in previous step should add a transpose to the result?
$$\beta\sum^{N}_{n=1} \bigg(t_n\phi(x_n)^T-w^T\phi(x_n)\phi(x_n)^T \bigg) =0$$
$$\sum^{N}_{n=1} \bigg(t_n\phi(x_n)^T-w^T\phi(x_n)\phi(x_n)^T \bigg) =0$$
$$\bigg(\sum^{N}_{n=1} t_n\phi(x_n)^T\bigg) - w^T \bigg( \sum^{N}_{n=1} \phi(x_n)\phi(x_n)^T \bigg) =0$$
not sure about this step.... but should equal this:
$$\phi^T t = w (\phi^T \phi)$$
$$w_{ML}=(\phi^T \phi)^{-1} \phi^T t$$
abuse of notation because a gradient requires a function with scalar parameters...
its really taking a partial derivative with respect to tensor $w_i$.
change:
$\nabla_w \bigg[w^T\phi\bigg]$
to:
$\frac{\partial}{\partial w_i} \bigg[w^T\phi\bigg]$
Then we can convert inner product to tensor notation
$\frac{\partial}{\partial w_i} \bigg[w_j \phi_j \bigg] = \frac{\partial}{\partial w_i} \bigg[w_j \bigg]\phi_j = \delta_{ij} \phi_j = \phi_i$
change:
$\beta\sum^{N}_{n=1} (t_n-w^T\phi(x_n)) \nabla_w \bigg(w^T\phi(x_n) \bigg) =0$
to:
$\beta\sum^{N}_{n=1} (t_n-w^T\phi(x_n)) \frac{\partial}{\partial w_i} \bigg(w^T\phi(x_n) \bigg) =0$
$\beta\sum^{N}_{n=1} (t_n-w^T\phi(x_n)) \phi_i(x_n) =0$
So now I have equations i=1 to n:
$\beta\sum^{N}_{n=1} (t_n-w^T\phi(x_n)) \phi_1(x_n) =0$
$\beta\sum^{N}_{n=1} (t_n-w^T\phi(x_n)) \phi_2(x_n) =0$
$...$
$\beta\sum^{N}_{n=1} (t_n-w^T\phi(x_n)) \phi_n(x_n) =0$
converting tensor $\phi_i$ back to a vector form...(why its a row or column at this point i have no idea...) now $\phi^T$ is a row vector:
$\beta\sum^{N}_{n=1} (t_n-w^T\phi(x_n)) \phi^T(x_n) =0$