I'd appreciate your help in confirming (or pointing out the bugs) the following calculation of $\frac {\partial L} {\partial W}$.
Let
$$ L := \frac 1 2 \| \vec{1}^T \sigma \left( W X \right) - \sigma \left( e_1^T X \right) \|_F^2, $$
where $ X \in \mathbb{R}^{(d \times n)}, W \in \mathbb{R}^{(d \times d)}, \vec{1} \in \mathbb{R}^{d} $, $e_1$ is the first column of the identity matrix $I_d$, and $\sigma$ is the element-wise ReLU (that is, $\sigma \left( s \right)_i = \max \left( 0, s_i \right)$ for a vector $s$, and $\sigma \left( S \right)_{ij} = \max \left( 0, S_{ij} \right)$ for a matrix $S$).
I've tried to follow the answer in this post and ended up having the expression below.
$$ \frac {\partial L} {\partial W} = \left\{ \sigma' \left( WX \right) \odot \vec{1} \left[ \vec{1}^T \sigma \left( W X \right) - \sigma \left( e_1^T X \right) \right] \right\} X^T $$
Please confirm that it is mathematically true.
Tnx.
$ \def\o{{\tt1}}\def\p{\partial} \def\E{{\cal E}} \def\L{\left}\def\R{\right}\def\LR#1{\L(#1\R)} \def\vec#1{\operatorname{vec}\LR{#1}} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\s#1{\sigma\LR{#1}} \def\t#1{\theta\LR{#1}} \def\c#1{\color{red}{#1}} $As in the linked post, let $\,\t{z}\,$ denote the Heaviside step function and define the variables $$\eqalign{ H &= \t{WX},\qquad S = \s{WX},\qquad dS = H\odot\LR{dW\,X} \\ b^T &= \s{e_1^TX},\qquad a^T = \o^TS-b^T \\ }$$ Then $$\eqalign{ L &= \frac 12\LR{a^T:a^T} \\ dL &= a^T:da^T \\ &= a^T:\o^TdS \\ &= \o a^T:H\odot\LR{dW\,X} \\ &= H\odot{\o a^T}:\LR{dW\,X} \\ &= \LR{H\odot \o a^T}X^T:{dW} \\ \grad{L}{W} &= \LR{H\odot \o a^T}X^T \\ }$$ which matches your result.