LMS Update Rule

1.3k Views Asked by At

I'm starting to study machine learning using Andrew Ng's class notes. I understand conceptually how linear regression works, but am having trouble with this equation:

$$ \theta_j := \theta_j + \alpha\left(y^{(i)} - h_\theta (x^{(i)})\right) x_j^{(i)} $$

This is of course the update rule used to obtain values of theta using gradient descent. The notes say that one of the properties of this equation is that the magnitude of the update is proportional to the error term (the part between the big parentheses).

But seems like the update would be biased by larger values of $x_j$ too, since we are multiplying the error term by the value of $x_j$. To give a simple example, if we have a dataset with 5 observations, which have $x_j$ values of 1 through 5, won't the observation with $x_j$=5 have 5 times as much 'impact' on the change in $\theta_j$ compared to the one with $x_j$ = 1? I'm guessing that I'm misunderstanding something about the algorithm or what the variable $x_j$ represents, but if someone could explain where I'm going wrong I'd greatly appreciate it.

1

There are 1 best solutions below

0
On

I'm reading through the same course and here's my take on it. You are not misunderstanding anything. An $x_j$ value 5 times larger with no change to the error will cause a steeper descent on the next iteration of the LMS method you gave. Consider the case of linear regression with a single feature, that is $\theta=(\theta_0, \theta_1) \in\mathbb{R}^{1+1}$, as we move from a hypothesis function $h_1$ to $h_2$ by adjusting $\theta_1$ a little in the plot below.

changing the hypothesis function

Moving $\theta_1$ has the effect of changing the slope of the affine hypothesis function. This, by it's nature, is more pronounced when point $A$ moves to point $B$ than when $C$ moves to $D$ and this is precisely what is being captured by the $x_1$ term in $\theta_j + \alpha(y^{(i)}-h_\theta(x^{(i)}))x_1^{(i)}$. This is just the nature of adjusting the hypothesis function by pivoting on the $y$-axis. Further away $x$ values result in larger swings in the hypothesis function value as $\theta$ changes. No need to worry about the $x_0$ case as it's always $1$.

Now, if we fix a training example $(x^{(i)}, y^{(i)})$ for some $i$ and (crudely) estimate the cost function to be $\hat{J}(\theta) = \frac{1}{2}(h_\theta(x^{(i)})-y^{(i)})^2$ (as is done in the Widrow–Hoff/LMS algorithm you provided), then the $x_j$ term comes out mathematically from the chain rule when differentiating $\hat{J}$ since:

$$\frac{\partial}{\partial{\theta_j}}(h_\theta(x^{(i)})-y^{(i)}) = x_j$$

which is capturing precisely the behaviour in the plot.

The impact of large $x_j$ would appear to effect all gradient descent methods for linear regression that use a least squares cost function (as the $x_j$ will always show up in the chain rule). However, for more intuition for the stochastic gradient descent linear regression method from which the Widrow-Hoff update rule arises, check out an explanation from Bernard Widrow himself: The LMS algorithm and ADALINE. Part I - The LMS algorithm