Let $L^2\equiv L^2(\mathcal X,\mu)$ be the space of square integrable functions over the bounded domain $\mathcal X\subset \mathbb R^d$ with Lebesgue measure $\mu$, equipped with its usual norm and inner product and let
$$\mathcal{F}_{\Theta}:= \left\{\hat{f}_{\theta} : x\in\mathcal X \mapsto W^{(L)}\circ \sigma \circ W^{(L-1)}\circ\ldots\circ \sigma \circ W^{(1)}(x)\mid W^{(k)}(y) := A_k x + b_k, 1\le k\le L \right\}$$
where $\sigma(x) := \max(0,x)$ (to be understood pointwise when applied to a vector) is the ReLU activation function and $\theta_k :=(A_k,b_k)$ are respectively weight matrices and bias vectors of appropriate dimensions belonging in some parameter space $\Theta$. In other words $\mathcal F_\Theta$ is the family of realizations of depth $L$ Neural Networks with ReLU activation over the parameter space $\Theta$. See e.g. page 6 of this paper for a more precise definition of this function space.
Consider now a function $f^*\in L^2$ such that $f^*$ is almost everywhere differentiable and $\nabla f^*\in L^2$ (we can assume higher regularity of $f^*$ if needed), and let $\hat f_\theta \in \mathcal F_\Theta$ be the best approximation of $f^*$ in $L^2$ i.e. $$\hat f_\theta \in \arg\min_{f\in\mathcal F_\Theta} \|f-f^*\|_{L^2}$$
Because all the functions in $\mathcal F_\Theta$ are Lipschitz, Rademacher's theorem ensures that their almost everywhere gradient is well defined (and in $L^2$, by boundedness of $\mathcal X$). My question is thus : Can we get any bound on the quantity $\|\nabla \hat f_\theta - \nabla f^*\|_{L^2} $ in terms of $\|\hat f_\theta-f^*\|_{L^2} $ ? In words, is the optimal neural network approximation of $f^*$ a good approximation of its gradient as well ?
Of course, in all generality, such a result can't hold because it would imply that the $L^2$ norm and the $W^{1,2}$ norm are equivalent, which is absurd. For this particular case however, some simple experiments such as those performed in this stats.SE post suggest that the result may hold. Similarly, the authors of this paper have shown that, in the one dimensional case, their exist neural networks which can uniformly approximate a target function and its derivative simultaneously. I still think that some more assumptions will be needed for the answer to my question to be positive though.
I would be happy with both direct answers and relevant references.
Thanks in advance.
Update : So I gave some more thought to this problem because it was bothering me quite a bit, and I think I have a clue of what might be going on.
The empirical observation is the following : if you fit a (smooth) function $f$ with a Neural Network $\hat f_\theta \in \mathcal F_\Theta$, and the fit is "good enough" (as in, the two functions look similar when you plot their graphs), then the derivatives will also be pretty close (I only tried with functions defined on the real line as it's easier to visualize everything, but I'm pretty sure the observation remains the same in higher dimensions).
I think there are two key factors at play here : the first is that $\mathbf{(A)}$ the fit is "good enough", i.e. we need to have $\|\hat f_\theta-f\|_{L^2}\le \varepsilon $ for $\varepsilon$ "small enough" in order to observe $\|\nabla\hat f_\theta-\nabla f\|_{L^2}\le C(\varepsilon) $ for non-vacuous $C(\varepsilon)$.
The second factor is that Neural Networks have a very specific structure : $\mathbf{(B)}$ they are piecewise linear and continuous functions.
With that in mind, I believe that it's possible to show that if $\|\nabla\hat f_\theta-\nabla f\|_{L^2}> L $ for large $L$, then $\|\hat f_\theta- f\|_{L^2}> M(L)$ for a non vanishing $M(L)$. My attempt at proving this in the 1D case is as follows :
Let $f$ and $f_\theta$ be defined on a bounded interval $I\subseteq\mathbb R$, such that $\|\hat f_\theta'-f'\|_{L^2}^2>L$. Our goal is to lower bound $\|\hat f_\theta-f\|_{L^2}^2$. Because of the way $\hat f_\theta $ is defined, we know that there exist a finite partition $I_1,\ldots,I_N\subseteq I$ of $I$ into $N$ subintervals such that $$\hat f_\theta(x) = \sum_{i=1}^N (a_i x + b_i) \mathbf{1}_{I_i}(x) \ \text{ for all } x\in I$$ Where the $a_i,b_i$ are some scalars. Now because of the lower bound, we know that there exists an index $i_0\in\{1,\ldots,N\}$ such that $$\int_{I_{i_0}}(\hat f_\theta'-f')^2 = \int_{I_{i_0}} (a_{i_0} - f')^2 > L/N $$
From here, we can lower bound $\|\hat f_\theta- f\|_{L^2}$ as follows $$\begin{align*}\|\hat f_\theta- f\|_{L^2}^2 &\ge \int_{I_{i_0}} (\hat f_\theta- f)^2\\ &=\int_{I_{i_0}} (a_{i_0}x + b_{i_0}- f(x))^2 dx\\ &\approx \int_{I_{i_0}} (a_{i_0}x_0 + b_{i_0}- f(x_0) + (a_{i_0}-f'(x))(x-x_0))^2 dx\\ &= \underbrace{|I_{i_0}|(a_{i_0}x_0 + b_{i_0}-f(x_0))^2}_{(\star)} + \underbrace{\int_{I_{i_0}} ((a_{i_0}-f'(x))(x-x_0))^2 dx}_{(\star \star)}\\ &+ 2\underbrace{\int_{I_{i_0}} (a_{i_0}x_0 + b_{i_0}-f(x_0))(a_{i_0}-f'(x))(x-x_0) dx}_{(\star \star \star)} \end{align*} $$
Where $x_0$ is any point in $I_{i_0}$. Now, $(\star)$ is non-negative so it won't bother us. Then, $(\star \star)$ is taken care of by spliting $I_{i_0}$ into a "$x$ close to $x_0$" part and a "$x$ far from $x_0$" bit, which ultimately should give something like $(\star \star)\ge |I_{i_0}| L/4N$. $(\star \star \star) $ is the trickier bit : I don't see an obvious way to control its sign and magnitude without further assumptions right now. I think a possible way to go about it is to ask $f'$ to not vary too much on $I_{i_0}$, such that $a_{i_0}-f'(x) $ is "almost constant sign" on $I_{i_0}$, and then pick $x_0$ such that $(\star \star \star)$ is non-negative. Then it would follow that $$\|\hat f_\theta- f\|_{L^2}^2 \gtrsim |I_{i_0}| L/4N $$ So if the above is correct, a bound on $\|\hat f_\theta-f\|_{L^2}$ does indeed imply a bound on $\|\hat f'_\theta-f'\|_{L^2}$.
So now my question(s) would be : How to make the above argument rigorous ? The first-order Taylor expansion in particular was very handwavy. Besides making it rigorous, I would also like to know How to adapt the argument to the multivariate case ? Lastly I have to mention that the above bound is very pessimistic : in practice $N$ is huge and $|I_{i_0}|$ is tiny. : Is it possible to obtain a better bound ? I actually believe that the smoother $f$ is, the better the bound will be. Also I haven't used the fact that $\hat f_\theta$ is continuous, so I'm fairly confident that it's possible to do better.
Thank you for your time !