Partial derivative w.r.t. MSE in linear regression gives different results than textbook

807 Views Asked by At

I'm going over the book Introduction to Machine Learning (Ethem Alpaydin) to brush up on my ML basics and am facing some confusion over a derivation.

More specifically:

There we used a single input linear model $g(x) = w_1 x + w_0$ where $w_1$ and $w_0$ are the parameters to learn from data. The $w_1$ and $w_0$ values should minimize: $$E(w_1, w_0\ |\ \mathcal{X}) = \frac{1}{N} \sum_{t = 1}^N \left( r^t - (w_1 x^t + w_0) \right)^2$$ Its minimum point cna be calculated by taking the partial derivatives of $E$ with respect to $w_1$ and $w_0$, setting them equal to $0$, and solving for the two unknowns: $$ \begin{align} w_1 & = \dfrac{\sum_t^N x^t r^t - \bar{x}\bar{r} N}{\sum_t^N (x^t)^2 - N\bar{x}^2} \\ w_0 & = \bar{r} - w_1 \bar{x} \end{align} $$ where $\bar{x} = \sum_t^N x^t/N$ and $\bar{r} = \sum_t^N r^t/N$

Correct me if I'm wrong but I believe that:

$$ \begin{align} \frac{\partial E}{\partial w_1} & = \frac{2}{N}\sum_{t = 1}^N \left( r^t - w_1 x^t - w_0 \right) \cdot (-x^t) \end{align} $$

Setting $\frac{\partial E } {\partial w_1}$ to $0$ gives me:

$$ w_1 = \frac{\sum_t^N x^t r^t - w_0 \sum_t^N x^t}{\sum_t^N (x^t)^2} $$

which is different from the textbook's derivation. Am I understanding something incorrectly? Thanks.

1

There are 1 best solutions below

0
On

Taking the derivatives with respect to $w_{0}$ and $w_{1}$ and equating them to $0$ we have: $$\frac{\delta E}{\delta w_{0}} = -\frac{2}{N}\sum_{t = 1}^{N}(r^{t} - w_{1}^{t}x^{t} - w_{0}) = 0$$ and: $$\frac{\delta E}{\delta w_{1}} = -\frac{2}{N}\sum_{t = 1}^{N}(r^{t} - w_{1}^{t}x^{t} - w_{0})x^{t} = 0$$

From the first, it follows that:

$$\sum_{t = 1}^{N}r^{t} - Nw_{0} - w_{1}\sum_{t = 1}^{N}x^{t} = 0 \Rightarrow \widehat{w}_{0} = \frac{\sum_{t = 1}^{N}r^{t}}{N} - \widehat{w}_{1}\frac{\sum_{t = 1}^{N}x^{t}}{N} = \overline{r} - \widehat{w}_{1}\overline{x}\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$$

where with $\widehat{w}_{0}$ and $\widehat{w}_{1}$ we refer to the estimated Least Squares estimators of $w_{0}$ and $w_{1}$ respectively. From the second, instead, it follows that:

$$\sum_{t = 1}^{N}x^{t}r^{t} - \widehat{w}_{0}\sum_{t = 1}^{N}x^{t} - \widehat{w}_{1}\sum_{t = 1}^{N}(x^{t})^{2} = 0 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)$$

Using the condition derived in $(1)$ and $\sum_{t = 1}^{N}x^{t}$ $=$ $N\overline{x}$:

$$\sum_{t = 1}^{N}x^{t}r^{t} - N\overline{x}\overline{r} + \widehat{w}_{1}N\overline{x}^{2} - \widehat{w}_{1}\sum_{t = 1}^{N}(x^{t})^{2} = 0 \Rightarrow \widehat{w}_{1} = \frac{\sum_{t = 1}^{N}x^{t}r^{t} - N \overline{x}\overline{r}}{\sum_{t = 1}^{N}(x^{t})^{2} - N\overline{x}^{2}}$$

as indeed reported in your text. It can be noticed that, if you're already familiar with the concept of Variance and Covariance, the nominator of the latter is the covariance between $r$ and $x$, while the denominator is the variance of $x$ (if we were to divide both the numerator and denominator by $\frac{1}{N}$, more precisely, where the two would then clearly cancel). Hope it clarifies.