Can you please help verify if the derivatives of loss with respect to weights, bias, and input data of a single layer neural network are correct?

Question

Can you please help verify if the derivatives of loss with respect to weights, bias, and input data of a single layer neural network are correct?

1.1k Views Asked by Bumbble Comm At 28 Mar 2026 - 12:49

I am trying to calculate the derivatives of the loss w.r.t weights, bias, and input for a single-layered neural network where the loss function is mean squared error. The derivation is as follows:-

$\bullet~$ Let the weight column vector, input data, bias,and output column vector be
$\mathbf{W} \in \mathbb{R}^n$, $\mathbf{X} \in \mathbb{R}^{m\times n}$, $\mathbb{b} \in \mathbb{R}$, $\mathbf{Y} \in \mathbb{R}^m$

$\bullet~$Let $\mathbf{Z} = \mathbf{X}\times\mathbf{W}+\mathbb{b}$ be the linear transformation

$\bullet~$$\hat{\mathbf{Y}} = \max(0,\mathbf{Z})$ be the relu activation

$\bullet~$$L = (\hat{\mathbf{Y}} - \mathbf{Y})^T(\hat{\mathbf{Y}} - \mathbf{Y})/\left|\mathbf{Y}\right|$ be the mean squared error \begin{align*} &\frac{\partial L}{\partial \hat{\mathbf{Y}}} = \frac{2}{\left|\mathbf{Y}\right|}(\hat{\mathbf{Y}} - \mathbf{Y})^T \in \mathbb{R}^{1\times m}\\ &\frac{\partial \hat{\mathbf{Y}}}{\partial \mathbf{Z}} = \text{diag}\bigg(\frac{\partial \hat{Y}_1}{\partial Z_1},\dots ,\frac{\partial \hat{Y}_m}{\partial Z_m}\bigg) \in \mathbf{R}^{m\times m},~ \text{ where } \frac{\partial \hat{Y}_i}{\partial Z_i} = \begin{cases} 0 & \text{if } Z_i \leqslant 0 \\ 1 & \text{otherwise} \end{cases} \\ &\frac{\partial L}{\partial \mathbf{Z}} = \frac{2}{\left|\mathbf{Y}\right|}(\hat{\mathbf{Y}} - \mathbf{Y})^T\times \text{diag}\bigg(\frac{\partial \hat{Y}_1}{\partial Z_1},\dots ,\frac{\partial \hat{Y}_m}{\partial Z_m}\bigg) \in \mathbf{R}^{1\times m} \end{align*}

$\blacksquare~$For weights: $$\frac{\partial \mathbf{Z}}{\partial \mathbf{W}} = \mathbf{X} \in \mathbf{R}^{m\times n}$$ Hence, $$\frac{\partial L}{\partial \mathbf{W}} = \frac{2}{\left|\mathbf{Y}\right|}(\hat{\mathbf{Y}} - \mathbf{Y})^T\times \text{diag}\bigg(\frac{\partial \hat{Y}_1}{\partial Z_1},\dots ,\frac{\partial \hat{Y}_m}{\partial Z_m}\bigg) \times \mathbf{X} \in \mathbf{R}^{1\times n}$$ In order the make the dimensions of $\dfrac{\partial L}{\partial \mathbf{W}}$ same as $\mathbf{W}$, we need to take transpose of the above equation. This makes the RHS - $$\mathbf{X}^T\times \text{diag}\bigg(\frac{\partial \hat{Y}_1}{\partial Z_1},\dots ,\frac{\partial \hat{Y}_m}{\partial Z_m}\bigg)^T\times \frac{2}{\left|\mathbf{Y}\right|}(\hat{\mathbf{Y}} - \mathbf{Y}) \in \mathbf{R}^n$$ My first question is -$\color{blue}{\text{ Is the above derivation correct? Or am I missing something?}}$

$\blacksquare~$For bias:

$$\frac{\partial \mathbf{Z}}{\partial\mathrm{b}} = \mathbf{1} \in \mathbf{R}^m$$ Hence, $$\frac{\partial L}{\partial \mathrm{b}} = \frac{2}{\left|\mathbf{Y}\right|}(\hat{\mathbf{Y}} - \mathbf{Y})^T\times \text{diag}\bigg(\frac{\partial \hat{Y}_1}{\partial Z_1},\dots ,\frac{\partial \hat{Y}_m}{\partial Z_m}\bigg) \times \mathbf{1} \in \mathbf{R}^{1\times 1}$$ My second question- $\color{blue}{\text{Is the above expression correct?}}$

$\blacksquare~$For data

$$\frac{\partial \mathbf{Z}}{\partial \mathbf{X}} = \mathbf{W} \in \mathbf{R}^n$$ Hence,

$$\frac{\partial L}{\partial \mathbf{X}} = \frac{2}{\left|\mathbf{Y}\right|}(\hat{\mathbf{Y}} - \mathbf{Y})^T\times \text{diag}\bigg(\frac{\partial \hat{Y}_1}{\partial Z_1},\dots ,\frac{\partial \hat{Y}_m}{\partial Z_m}\bigg) \times \mathbf{W}$$ $\color{red}{\text{Which is not correct because of the mismatch of the dimensions}}$

The only way this will work is as follows: $$\frac{\partial L}{\partial \mathbf{X}} = \mathbf{W}\times \frac{2}{\left|\mathbf{Y}\right|}(\hat{\mathbf{Y}} - \mathbf{Y})^T\times \text{diag}\bigg(\frac{\partial \hat{Y}_1}{\partial Z_1},\dots ,\frac{\partial \hat{Y}_m}{\partial Z_m}\bigg)$$ $\color{magenta}{\text{This seems to be just wrong to me.}}$ Can you please help me in understanding what is going wrong here?

Thanks!

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

You were doing pretty good until you got to $\frac{\partial{\cal L}}{\partial X}$
The problem is that $\frac{\partial z}{\partial X}$ is not a matrix but rather a 3rd order tensor!

The simplest way to avoid such tensors in matrix calculus is to use differentials.
First a bit of notation $$\eqalign{ z &= Xw + {\tt1}\beta \\ dz &= dX\,w \qquad&({\rm the\,differential\,of\,}z) \\ {\cal H}(z_k) &= \begin{cases}1\quad{\rm if}\quad z_k>0\\0\quad{\rm otherwise} \end{cases} \qquad&({\rm Heaviside\,step\,function}) \\ h &= {\cal H}(z) \qquad&({\rm apply\,the\,function\,elementwise}) \\ H &= {\rm Diag}(h) \qquad&({\rm diagonal\,\{{\tt0},\!{\tt1}\}\,matrix}) \\ A:B &= {\rm Tr}(A^TB) \qquad&({\rm Frobenius\,product}) \\ \\ }$$ The Heaviside function affords a more succinct way to write one of the earlier gradients
$$\eqalign{ \frac{\partial\hat y}{\partial z} &= H }$$ Next rewrite one of the previously calculated gradients in differential form and then perform the change of variables $z\to X$
$$\eqalign{ d{\cal L} &= \left(\frac{\partial{\cal L}}{\partial z}\right):dz \\ &= 2\|y\|^{-1}H(\hat y-y):dz \\ &= 2\|y\|^{-1}H(\hat y-y):dX\,w \\ &= 2\|y\|^{-1}H(\hat y-y)w^T:dX \\ \frac{\partial{\cal L}}{\partial X} &= 2\|y\|^{-1}H(\hat y-y)w^T \\ }$$ And now the dimensions work out perfectly (although it appears that your preferred layout convention is the transpose of this).

The key is that the differential of a matrix is just another matrix and obeys all of the rules of matrix algebra. This is simply not true for tensors.

Not only that, but it's impossible to write tensor expressions unless/until you learn index notation.

Update

This update is to clear up some questions in the comments.

Here is a list of the sizes of the various variables and products which occur in the solution $$\eqalign{ \beta &\in {\mathbb R}^{1\times 1} \\ w &\in {\mathbb R}^{n\times 1} \\ h,y,\hat y,z &\in {\mathbb R}^{m\times 1} \\ X &\in {\mathbb R}^{m\times n} \\ H &\in {\mathbb R}^{m\times m} \\ Xw,\,Hy &\in {\mathbb R}^{m\times 1} \\ Hyw^T &\in {\mathbb R}^{m\times n} \\ }$$ The properties of the trace function permit the terms in a Frobenius product $(:)$ to be rearranged in a number of equivalent ways, e.g. $$\eqalign{ &A:B = B:A = B^T:A^T \\ &A:BC = AC^T:B = C^T:BA^T = etc \\ }$$ Note that the matrix on each side of the product symbol (i.e. the colon) is exactly the same size. This is the same requirement as the Hadamard product. In fact, the Frobenius product can be defined as a Hadamard product $(\odot)$ followed by summation. $$\eqalign{ A:B &= \sum_i\sum_j (A\odot B)_{ij} \\ }$$ Finally, a gradient and a differential are two ways of conveying the same information $$\eqalign{ df = G:dX\qquad\iff\qquad G=\left(\frac{\partial f}{\partial X}\right)\\ \\ }$$

Update #2

Here are the differentials of $z$. $$\eqalign{ z &= Xw + {\tt1}\beta \\ dz &= dX\,w \quad&({\rm wrt\,}X) \\ dz &= X\,dw \quad&({\rm wrt\,}w) \\ dz &= {\tt1}\,d\beta\quad&({\rm wrt\,}\beta) \\ }$$ This post has already established that $$\eqalign{ \hat y &= \max(z,0) \\ d\hat y &= H\,dz \\ \frac{\partial\hat y}{\partial z} &= H \;=\; H^T \quad ({\rm it's\,symmetric}) \\ }$$ Let's calculate $\frac{\partial{\cal L}}{\partial\hat y}$ $$\eqalign{ {\cal L} &= \|y\|^{-1}(\hat y-y):(\hat y-y) \\ d{\cal L} &= 2\|y\|^{-1}(\hat y-y):d\hat y \\ \frac{\partial{\cal L}}{\partial\hat y} &= 2\|y\|^{-1}(\hat y-y) \\ }$$ Substituting $\,d\hat y=H dz\,$ yields $$\eqalign{ d{\cal L} &= 2\|y\|^{-1}(\hat y-y):H\,dz \\ &= 2\|y\|^{-1}H^T(\hat y-y):dz \\ &= 2\|y\|^{-1}H(\hat y-y):dz \\ \frac{\partial{\cal L}}{\partial z} &= 2\|y\|^{-1}H(\hat y-y) \\ }$$ The other gradients are obtained by substituting $dz$ with the appropriate differential, e.g. $$\eqalign{ d{\cal L} &= 2\|y\|^{-1}H(\hat y-y):dz \\ &= 2\|y\|^{-1}H(\hat y-y):{\tt1}\,d\beta \\ &= 2\|y\|^{-1}{\tt1}^TH(\hat y-y):d\beta \\ \frac{\partial{\cal L}}{\partial\beta} &= 2\|y\|^{-1}{\tt1}^TH(\hat y-y) \\ }$$

Can you please help verify if the derivatives of loss with respect to weights, bias, and input data of a single layer neural network are correct?

There are 1 best solutions below

Update

Update #2

Related Questions in CALCULUS

Related Questions in MULTIVARIABLE-CALCULUS

Related Questions in MATRIX-CALCULUS

Related Questions in NEURAL-NETWORKS

Trending Questions

Popular # Hahtags

Popular Questions