I can't actually picture in my head how the gradient w.r.t $\theta$ in this equation works geometrically. For example, for $x, y \in \mathbb{R}^d$ and $\theta \in \mathbb{R}^{d \times d}$
$$\begin{aligned} \nabla_\Theta\frac{1}{2}\Vert \Theta x - y \Vert^2_2 &= \nabla_\Theta(\Theta x - y)^\top (\Theta x - y) \\ &= \nabla_\Theta \frac{1}{2} \left( x^\top \Theta^\top \Theta x - x^\top \Theta^\top y - t^\top \Theta x + y^\top y \right) \\ &= \nabla_\Theta \frac{1}{2}(\text{tr}(x^\top \Theta^\top \Theta x) - \text{tr}(x^\top \Theta^\top y) - \text{tr}(y^\top \Theta x)) \\ &= \nabla_\Theta \frac{1}{2}(\text{tr}(\Theta^\top \Theta x x^\top ) - \text{tr}(\Theta^\top yx^\top ) - \text{tr}((y^\top \Theta x)^\top)) \\ &= \nabla_\Theta \frac{1}{2}(\text{tr}(\Theta^\top \Theta x x^\top ) - \text{tr}(\Theta^\top yx^\top ) - \text{tr}((y^\top \Theta x)^\top)) \\ &= \Theta x x^\top - yx^\top \\ \end{aligned}$$
I can see intuitively that if we did something like gradient descent $\Theta_{t+1} = \Theta_t - \lambda\nabla_\Theta$, we would follow this function towards its minimum, but I can only see this by blindly trusting the rules of differentiation.
I do not have a good geometric model in my head of what $\Theta x x^\top - y x^\top$ actually means.
- $\Theta x - y$ obviously gives the unsquared error term, but why does $x^\top$ show up here.
- I can naively see that the gradient would not be in the right shape without $x^\top$, but is there any other way to interpret this?
- Is there any convenient way to see this or any resource which might cover this sort of thing?
In one dimension, the derivative quadratic loss from the linear equation is as follows,
$$ \frac{d}{dm} \frac{1}{2}(mx - y)^2 = (mx - y)x $$
This can be seen as taking the gradient of $v^2$ (where $v^2$ is the quadratic error) at the location of $x$ in the input space. In the case of higher dimensions (with $\Theta \in \mathbb{R}^{2 \times 2}$ ), it works out to be,
$$ \begin{aligned} \nabla_\Theta\frac{1}{2}\Vert \Theta x - y \Vert^2_2 &= \nabla_\Theta(\Theta x - y)^\top (\Theta x - y) \\ &= \nabla_\Theta \frac{1}{2} \left( x^\top \Theta^\top \Theta x - x^\top \Theta^\top y - t^\top \Theta x + y^\top y \right) \\ &= \nabla_\Theta \frac{1}{2}(\text{tr}(x^\top \Theta^\top \Theta x) - \text{tr}(x^\top \Theta^\top y) - \text{tr}(y^\top \Theta x)) \\ &= \nabla_\Theta \frac{1}{2}(\text{tr}(\Theta^\top \Theta x x^\top ) - \text{tr}(\Theta^\top yx^\top ) - \text{tr}((y^\top \Theta x)^\top)) \\ &= \nabla_\Theta \frac{1}{2}(\text{tr}(\Theta^\top \Theta x x^\top ) - \text{tr}(\Theta^\top yx^\top ) - \text{tr}((y^\top \Theta x)^\top)) \\ &= \Theta x x^\top - yx^\top \\ &= (\Theta x - y) x^\top \\ \end{aligned} $$
This is more complicated to interpret because of the fact that the matrix $\Theta \in \mathbb{R}^{2 \times 2}$ and therefore the gradient must be of the same size. If we think about how the matrix multiplication works in the first place,
$$ \begin{bmatrix} \theta_{1,1} & \theta_{1,2} \\ \theta_{2,1} & \theta_{2,2} \\ \end{bmatrix} \begin{bmatrix} x_{1} \\ x_{2} \\ \end{bmatrix} \begin{bmatrix} \theta_{1,1} x_{1} + \theta_{1,2} x_{2} \\ \theta_{2,1} x_{1} + \theta_{2,2} x_{2} \\ \end{bmatrix} $$
The first row of $\Theta$ gets multiplied by each dimension in $x$ , meaning that the output in the first dimension depends on the first and second dimension of $x$. Therefore the gradient to both of these entries in $\Theta$ should likewise depend on the first and second dimensions of $x$ (proportionately).
$$ \begin{bmatrix} \theta_{1,1} x_{1} + \theta_{1,2} x_{2} - y \\ \theta_{2,1} x_{1} + \theta_{2,2} x_{2} - y \\ \end{bmatrix} \begin{bmatrix} x_{1} & x_{2} \\ \end{bmatrix} = \begin{bmatrix} (\theta_{1,1} x_{1} + \theta_{1,2} x_{2} - y)x_1 & (\theta_{1,1} x_{1} + \theta_{1,2} x_{2} - y)x_2\\ (\theta_{2,1} x_{1} + \theta_{2,2} x_{2} - y) x_1 & (\theta_{2,1} x_{1} + \theta_{2,2} x_{2} - y) x_1 \\ \end{bmatrix} $$
So each term in the fonal gradient to $\Theta$ is multiplied by the magnitude of the input dimension which it was originally multiplied with, just like in the single dimensional case. If we move $\Theta$ in the direction of the negative gradient $\Theta_{t+1} = \Theta_t - \nabla_\Theta$ then each entry in the respective row of $\Theta$ will be relatively scaled according to the magnitude of both the error and the input dimension.