Lipschitz constant of gradient of feedforward regressor

29 Views Asked by At

I need to compute the gradient of a feedforward network with ReLU ($f(x) = \max(0, x)$) activations under the mean squared error loss. I'm using the following notation: $a^{[l]} = g(z^{[l]})$, and $a^{[l]} = g(z^{[l]})$ where $g$ is the ReLU function. Finally, $z^{[l]} = W^{[l]T} a^{[l-1]} + b^{[l]}$ Here's my work so far:

$$ \begin{aligned} E &= \frac{1}{2m} \sum\limits_{i=1}^m (\boldsymbol a^{[L]}_i - \boldsymbol y_i)^2 & \\ \nabla_W E &= \frac{1}{m} \sum\limits_{i=1}^m (\boldsymbol a^{[L]}_i - \boldsymbol y_i) \nabla_W \boldsymbol a^{[L]} & \\ &\leq \frac{1}{m} \sum\limits_{i=1}^m (\boldsymbol a^{[L]}_i - \boldsymbol y_i) \nabla_W \boldsymbol z^{[L]} & \textit{since $\boldsymbol a^{[L]} = ReLU(z^{[L]}) \leq z^{[L]}$} \\ &= \frac{1}{m} \sum\limits_{i=1}^m (\boldsymbol a^{[L]}_i - \boldsymbol y_i) \nabla_W \boldsymbol W^{[L]T} \boldsymbol a^{[L-1]} & \textit{$\nabla_W b^{[L]} = 0$} \\ &= \frac{1}{m} \sum\limits_{i=1}^m (\boldsymbol a^{[L]}_i - \boldsymbol y_i) \boldsymbol a^{[L-1]T} & \\ \nabla_W^2 E &= \frac{1}{m} \left( \sum\limits_{i=1}^m \nabla_W \boldsymbol a^{[L]}_i \right) \boldsymbol a^{[L-1]T} & \\ &\leq \frac{1}{m} \left( \sum\limits_{i=1}^m \nabla_W \boldsymbol z^{[L]}_i \right) a^{[L-1]T} & \\ &= \frac{1}{m} \left(\sum\limits_{i=1}^m \boldsymbol a^{[L-1]}_i \right) \boldsymbol a^{[L-1]} & \textit{Element-wise square} \\ \end{aligned} $$

Is my math correct? I'm particularly antsy about the second gradient part. Having found the second gradient, I would take the max norm on both sides.