I am studying Machine Learning, but I believe you guys should be able to help me with this!
Basically, we have given a set of training data $\{(x_1,y_1), x(x_2,y_2), ..., (x_n, y_n)\}$, and we need to train a perceptron to fit the data as best as possible. A perceptron here, is a simple model consisting of a weight vector $w$, and the perceptron outputs $w^Tx$ for an input $x$.
We define an error function $ E(w) = \frac{1}{2N} \sum_{i=1}^{N}(t_d - o_d)^2$, where $t_d - o_d$ is simply the difference between the ideal target value, and our output, $w^Tx$.
A way to minimize the error is to compute the gradient, and we obtain
$\frac{\partial E}{\partial w_i} = \frac{1}{N} \sum_{i=1}^{N}(t_d - o_d)(-x_{id})$
Now, in a computer algorithm, I can, on each iteration, update each $w_i$ to $w_i + \eta \frac{\partial E}{\partial w_i}$, but the problem is that computing that is slow, as it ranges over the whole training set of data.
So, what has been invented, is the LMS (Least Mean Squares) rule, that claims that
$\frac{1}{N} \sum_{i=1}^{N}(t_d - o_d)(-x_{id}) \approx (t_d - o_d)(-x_{id}) $
which means I can just use the current training example to perform my gradient descent.
Now, after this intro, I would like to ask for a bit more intuition and formality behind this LMS rule, and why it is a good enough approximation. I guess I would like a bit more explanation on the $\approx$ part of the above equation, and when, and how much it holds. Thanks for any help.
I learned that the perceptron only works on one example at a time and the update is not done based on gradient decent, but simply by error. Basically if you guess the example label's correctly you do nothing, if you guess incorrectly you update the weight for the parameters paired with the wrong label down one and the correct label up one, where one is your step size.
What you just described sounds like what I learned as logistic regression.
Here is the literature that I learned it from.
http://www.cc.gatech.edu/~jeisenst/classes/cs7650_sp12/lec2.pdf
I have code for the perceptron in python if that would be useful, but its pretty ugly and undocumented. There is probably better open source code around.
However all of this could possibly depend on the semantics of your class, so take it with a grain of salt.
PS. for a formal reason why its not that bad to assume that the summation is approximately equal to any given sample you should probably look at statistics and the law of big samples. The more samples you take from the distribution the greater the probability is that the samples will represent your original distribution. If this route interests you I can post a paper where bayes models are trained with montey carlo techniques.