Problems with vector vector derivative in optimization

132 Views Asked by Bumbble Comm At 06 Apr 2026 - 4:14

I have a loss function of the followoing form:

$L(\mathbf{a}) = \|\mathbf{b} - \mathbf{a}\|_2^2$

Where, $\mathbf{a}$ and $\mathbf{b}$ are vectors of dimension $d\times 1$. I need to calculate $\frac{\partial L}{\partial \mathbf{a}}$

If I am correct $\frac{\partial L}{\partial \mathbf{a}} = -2(\mathbf{b} - \mathbf{a})^T$ (I am not sure)

I have two questions:

What is derivation of $\frac{\partial L}{\partial \mathbf{a}}$
I know that $\frac{\partial L}{\partial \mathbf{a}}$ is a vector of dimension $1\times d$ instead of $d\times 1$ ($L$ is a scalar and $\mathbf{a}$ is a vector). In that case how can I update $\mathbf{a}$ based on gradient descent (because of different dimensionality)? The update rule is: $\mathbf{a} = \mathbf{a} - \beta \frac{\partial L}{\partial \mathbf{a}}$ ? ($\beta$ is the learning rate)

There are 1 best solutions below

Bumbble Comm On 22 Jan 2015 - 1:44 BEST ANSWER

You need the Matrix Cookbook. But honestly, it is not that difficult to derive this particular case by looking at each element: $$\frac{\partial L}{\partial a_i} = \frac{\partial}{\partial a_i}\sum_{j=1}^n(b_j-a_j)^2 = 2(b_i-a_i)\cdot -1 = -2(b_i-a_i).$$ From here, you simply have $$\frac{\partial L}{\partial \mathbf{a}} = \begin{bmatrix} \frac{\partial L}{\partial a_1} & \frac{\partial L}{\partial a_2} & \dots & \frac{\partial L}{\partial a_d} \end{bmatrix}=-2(b-a)^T.$$So on this point you are correct!
The actual update rule is $\mathbf{a} = \mathbf{a} - \beta \nabla L(\mathbf{a})$, where $\nabla L$ is the gradient of $L$. The gradient is the transpose of this quantity $\frac{\partial L}{\partial \mathbf{a}}$ above. The above quantity is sometimes called the Jacobian, but Jacobians are usually considered for multi-valued functions $f:\mathbb{R}^n\rightarrow\mathbb{R}^m$, where there is quite a material difference between an $m\times n$ matrix instead of an $n\times m$ matrix. For scalar valued functions, people are often a bit looser with the association between the gradient and the Jacobian, and often use them interchangeably, even if they shouldn't.