Studying from Christopher Bishop's Pattern Recognition and Machine Learning book, I came across the following:
$$ \frac{\partial E_n}{\partial w_{ji}^{(l)}} = \frac{E_n(w_{ji}^{(l)} + \epsilon) - E_n(w_{ji}^{(l)})}{\epsilon} + O(\epsilon)$$
I am looking for a formal proof of the former using the definition of the big O notation:
$$ f(x) = O(g(x)) \text{ as } x \to a \iff \exists \delta,M\in\mathbb{R}^{++}.\ 0<|x - a|<\delta \implies|f(x)|\leq Mg(x) $$
Although $E_n$ is, in the context of the book, the error term of a neural network, we can consider $E_n$ as a complicated non-convex function. Since
$$ E_n= \sum_i(\hat y_i - y_i)^2 $$
and
$$ \hat y_i =\sum_kw_{ji}^{(2)} h\left(\sum_{k'}w_{kk'}^{(1)}x_{k'}\right) $$
Where $h$ is a non-linear differentiable function.
Which would translate to a proof of the big-O term in a first-order Taylor expansion for any function $f$.