Theoretical proof of convergence of sequential weight update procedure (Neural Networks and Machine Learning)

583 Views Asked by At

My question is at the bottom. (Most of the descriptive words come from Chris. Bishop's Neural Networks for Pattern Recognition)

Let $w$ be the weight vector of the neural network and $E$ the error function.

According to the Robbins-Monro algorithm, this sequence: $$w_{kj}^{(r+1)}=w_{kj}^{(r)}-\eta\left.\frac{\partial E}{\partial w_{kj}}\right|_{w^{(r)}}$$ will converge to a limit, for which: $$\frac{\partial E}{\partial w_{kj}}=0.$$

In general the error function is given by a sum of terms, each of which is calculated using one of the patterns from the training set, so that $$E=\sum_nE^n(w)$$ And in applications we just update the weight vector using one pattern at a time $$w_{kj}^{(r+1)}=w_{kj}^{(r)}-\eta\frac{\partial E^n}{\partial w_{kj}}$$

My question is: Why will the algorithm converge using the last formula? Once we use it to update the $w$, the value of $w$ is changed, and I can't prove the convergence using $$\frac{\partial E}{\partial w_{kj}}=\sum_n \frac{\partial E^n}{\partial w_{kj}}$$