How to geometrically interpret the gradient of this linear equation?

72 Views Asked by At

I can't actually picture in my head how the gradient w.r.t $\theta$ in this equation works geometrically. For example, for $x, y \in \mathbb{R}^d$ and $\theta \in \mathbb{R}^{d \times d}$

$$\begin{aligned} \nabla_\Theta\frac{1}{2}\Vert \Theta x - y \Vert^2_2 &= \nabla_\Theta(\Theta x - y)^\top (\Theta x - y) \\ &= \nabla_\Theta \frac{1}{2} \left( x^\top \Theta^\top \Theta x - x^\top \Theta^\top y - t^\top \Theta x + y^\top y \right) \\ &= \nabla_\Theta \frac{1}{2}(\text{tr}(x^\top \Theta^\top \Theta x) - \text{tr}(x^\top \Theta^\top y) - \text{tr}(y^\top \Theta x)) \\ &= \nabla_\Theta \frac{1}{2}(\text{tr}(\Theta^\top \Theta x x^\top ) - \text{tr}(\Theta^\top yx^\top ) - \text{tr}((y^\top \Theta x)^\top)) \\ &= \nabla_\Theta \frac{1}{2}(\text{tr}(\Theta^\top \Theta x x^\top ) - \text{tr}(\Theta^\top yx^\top ) - \text{tr}((y^\top \Theta x)^\top)) \\ &= \Theta x x^\top - yx^\top \\ \end{aligned}$$

I can see intuitively that if we did something like gradient descent $\Theta_{t+1} = \Theta_t - \lambda\nabla_\Theta$, we would follow this function towards its minimum, but I can only see this by blindly trusting the rules of differentiation.

I do not have a good geometric model in my head of what $\Theta x x^\top - y x^\top$ actually means.

  • $\Theta x - y$ obviously gives the unsquared error term, but why does $x^\top$ show up here.
  • I can naively see that the gradient would not be in the right shape without $x^\top$, but is there any other way to interpret this?
  • Is there any convenient way to see this or any resource which might cover this sort of thing?
1

There are 1 best solutions below

0
On BEST ANSWER

In one dimension, the derivative quadratic loss from the linear equation is as follows,

$$ \frac{d}{dm} \frac{1}{2}(mx - y)^2 = (mx - y)x $$

This can be seen as taking the gradient of  $v^2$  (where  $v^2$  is the quadratic error) at the location of  $x$  in the input space. In the case of higher dimensions (with $\Theta \in \mathbb{R}^{2 \times 2}$ ), it works out to be,

$$ \begin{aligned} \nabla_\Theta\frac{1}{2}\Vert \Theta x - y \Vert^2_2 &= \nabla_\Theta(\Theta x - y)^\top (\Theta x - y) \\ &= \nabla_\Theta \frac{1}{2} \left( x^\top \Theta^\top \Theta x - x^\top \Theta^\top y - t^\top \Theta x + y^\top y \right) \\ &= \nabla_\Theta \frac{1}{2}(\text{tr}(x^\top \Theta^\top \Theta x) - \text{tr}(x^\top \Theta^\top y) - \text{tr}(y^\top \Theta x)) \\ &= \nabla_\Theta \frac{1}{2}(\text{tr}(\Theta^\top \Theta x x^\top ) - \text{tr}(\Theta^\top yx^\top ) - \text{tr}((y^\top \Theta x)^\top)) \\ &= \nabla_\Theta \frac{1}{2}(\text{tr}(\Theta^\top \Theta x x^\top ) - \text{tr}(\Theta^\top yx^\top ) - \text{tr}((y^\top \Theta x)^\top)) \\ &= \Theta x x^\top - yx^\top \\ &= (\Theta x - y) x^\top \\ \end{aligned} $$

This is more complicated to interpret because of the fact that the matrix  $\Theta \in \mathbb{R}^{2 \times 2}$ and therefore the gradient must be of the same size. If we think about how the matrix multiplication works in the first place,

$$ \begin{bmatrix} \theta_{1,1} & \theta_{1,2} \\ \theta_{2,1} & \theta_{2,2} \\ \end{bmatrix} \begin{bmatrix} x_{1} \\ x_{2} \\ \end{bmatrix} \begin{bmatrix} \theta_{1,1} x_{1} + \theta_{1,2} x_{2} \\ \theta_{2,1} x_{1} + \theta_{2,2} x_{2} \\ \end{bmatrix} $$

The first row of  $\Theta$ gets multiplied by each dimension in  $x$ , meaning that the output in the first dimension depends on the first and second dimension of  $x$. Therefore the gradient to both of these entries in $\Theta$ should likewise depend on the first and second dimensions of  $x$ (proportionately).

$$ \begin{bmatrix} \theta_{1,1} x_{1} + \theta_{1,2} x_{2} - y \\ \theta_{2,1} x_{1} + \theta_{2,2} x_{2} - y \\ \end{bmatrix} \begin{bmatrix} x_{1} & x_{2} \\ \end{bmatrix} = \begin{bmatrix} (\theta_{1,1} x_{1} + \theta_{1,2} x_{2} - y)x_1 & (\theta_{1,1} x_{1} + \theta_{1,2} x_{2} - y)x_2\\ (\theta_{2,1} x_{1} + \theta_{2,2} x_{2} - y) x_1 & (\theta_{2,1} x_{1} + \theta_{2,2} x_{2} - y) x_1 \\ \end{bmatrix} $$

So each term in the fonal gradient to  $\Theta$ is multiplied by the magnitude of the input dimension which it was originally multiplied with, just like in the single dimensional case. If we move  $\Theta$  in the direction of the negative gradient  $\Theta_{t+1} = \Theta_t - \nabla_\Theta$ then each entry in the respective row of  $\Theta$  will be relatively scaled according to the magnitude of both the error and the input dimension.