proof: maximum likelihood and least square is pseudo-inverse

140 Views Asked by At

Why does does the following gradient equal transpose of phi?

$\nabla_w \bigg[w^T\phi\bigg] = \phi^T$

instead of just phi?

$\nabla_w \bigg[w^T\phi\bigg] = \phi$


as in minimizing sum-of-squares error problem..

$$\ln p(t|w, \beta) = \prod^{N}_{n=1} \ln \Bigg[ \bigg( \frac{1}{(2\pi\sigma^2)^{1/2}}\bigg)exp\bigg( \frac{-1}{2\sigma^2}(t_n-\mu)^2 \bigg) \Bigg]$$

$$\ln p(t|w, \beta) = \sum^{N}_{n=1} \bigg[ \frac{-1}{2}\ln(2\pi\sigma^2) + \frac{-1}{2\sigma^2}(t_n-\mu)^2 \bigg]$$

$$\ln p(t|w, \beta) = \sum^{N}_{n=1} \bigg[ \frac{-1}{2\sigma^2}(t_n-\mu)^2 + \frac{-1}{2}\ln(2\pi\sigma^2)\bigg]$$

$$\ln p(t|w, \beta) = \frac{-1}{2\sigma^2} \sum^{N}_{n=1} \bigg[ (t_n-\mu)^2 \bigg] + \frac{-N}{2}\ln(2\pi\sigma^2)$$

$$\ln p(t|w, \beta) = -\frac{1}{2\sigma^2}\sum^{N}_{n=1} \bigg[ (t_n-\mu)^2 \bigg] - \frac{N}{2}ln(\sigma^2) - \frac{N}{2}\ln(2\pi)$$

$$\ln p(t|w, \beta) = -\beta \bigg(\frac{1}{2}\sum^{N}_{n=1} (t_n-w^T\phi(x_n))^2 \bigg) + \frac{N}{2}ln(\beta) - \frac{N}{2}\ln(2\pi)$$

$$E(w) = \frac{1}{2}\sum^{N}_{n=1} (t_n-w^T\phi(x_n))^2$$

$$\ln p(t|w, \beta) = -\beta E(w) + \frac{N}{2}ln(\beta) - \frac{N}{2}\ln(2\pi)$$

Now, I want to maximize the probability with respect to weighting coefficients w. To do this we take the Gradient with respect to w, and then find the value of w when the gradient is zero, this will maximize the probability or likelihood.

$$\nabla_w \bigg( -\beta E(w) + \frac{N}{2}ln(\beta) - \frac{N}{2}\ln(2\pi) \bigg) = 0$$

$$-\beta~\nabla_w E(w) = 0 $$

$$-\frac{\beta}{2}\sum^{N}_{n=1} \nabla_w \bigg((t_n-w^T\phi(x_n))^2 \bigg)=0$$

$$\beta\sum^{N}_{n=1} \nabla_w \bigg((t_n-w^T\phi(x_n))^2 \bigg)=0$$

$$\beta\sum^{N}_{n=1} 2(t_n-w^T\phi(x_n)) \nabla_w \bigg(t_n-w^T\phi(x_n) \bigg) =0$$

$$\beta\sum^{N}_{n=1} (t_n-w^T\phi(x_n)) \nabla_w \bigg(w^T\phi(x_n) \bigg) =0$$

Here's where i get confused....book say result should be:

$$\beta\sum^{N}_{n=1} (t_n-w^T\phi(x_n)) \ \phi(x_n)^T = 0$$

how do you know the gradient in previous step should add a transpose to the result?

$$\beta\sum^{N}_{n=1} \bigg(t_n\phi(x_n)^T-w^T\phi(x_n)\phi(x_n)^T \bigg) =0$$

$$\sum^{N}_{n=1} \bigg(t_n\phi(x_n)^T-w^T\phi(x_n)\phi(x_n)^T \bigg) =0$$

$$\bigg(\sum^{N}_{n=1} t_n\phi(x_n)^T\bigg) - w^T \bigg( \sum^{N}_{n=1} \phi(x_n)\phi(x_n)^T \bigg) =0$$

not sure about this step.... but should equal this:

$$\phi^T t = w (\phi^T \phi)$$

$$w_{ML}=(\phi^T \phi)^{-1} \phi^T t$$

2

There are 2 best solutions below

0
On

abuse of notation because a gradient requires a function with scalar parameters...

its really taking a partial derivative with respect to tensor $w_i$.

change:

$\nabla_w \bigg[w^T\phi\bigg]$

to:

$\frac{\partial}{\partial w_i} \bigg[w^T\phi\bigg]$

Then we can convert inner product to tensor notation

$\frac{\partial}{\partial w_i} \bigg[w_j \phi_j \bigg] = \frac{\partial}{\partial w_i} \bigg[w_j \bigg]\phi_j = \delta_{ij} \phi_j = \phi_i$


change:

$\beta\sum^{N}_{n=1} (t_n-w^T\phi(x_n)) \nabla_w \bigg(w^T\phi(x_n) \bigg) =0$

to:

$\beta\sum^{N}_{n=1} (t_n-w^T\phi(x_n)) \frac{\partial}{\partial w_i} \bigg(w^T\phi(x_n) \bigg) =0$

$\beta\sum^{N}_{n=1} (t_n-w^T\phi(x_n)) \phi_i(x_n) =0$

So now I have equations i=1 to n:

$\beta\sum^{N}_{n=1} (t_n-w^T\phi(x_n)) \phi_1(x_n) =0$

$\beta\sum^{N}_{n=1} (t_n-w^T\phi(x_n)) \phi_2(x_n) =0$

$...$

$\beta\sum^{N}_{n=1} (t_n-w^T\phi(x_n)) \phi_n(x_n) =0$

converting tensor $\phi_i$ back to a vector form...(why its a row or column at this point i have no idea...) now $\phi^T$ is a row vector:

$\beta\sum^{N}_{n=1} (t_n-w^T\phi(x_n)) \phi^T(x_n) =0$

0
On

According to the Wikipedia article: matrix calculus, taking a vector derivative of a scalar is a row vector:


from wikipedia:

The derivative of a scalar y by a vector

$$\mathbf{x} = \begin{bmatrix} x_1 & x_2 & \cdots & x_n \end{bmatrix}^\mathsf{T}$$

is written as

$$\frac{\partial y}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial y}{\partial x_1} & \frac{\partial y}{\partial x_2} & \cdots & \frac{\partial y}{\partial x_n} \end{bmatrix}. $$


thus:

$$\mathbf{\phi} = \begin{bmatrix} \phi_1 & \phi_2 & \cdots & \phi_n \end{bmatrix}^\mathsf{T}$$

$\frac{\partial}{\partial w} \bigg[w^T\phi\bigg] = \begin{bmatrix}\frac{\partial}{\partial w_1} w_1 \phi_1 & \frac{\partial}{\partial w_2} w_2 \phi_2 & ... & \frac{\partial}{\partial w_n} w_n \phi_n \end{bmatrix} = \begin{bmatrix}\phi_1 & \phi_2 & ... & \phi_n \end{bmatrix} = \phi^T$