I'm going over the book Introduction to Machine Learning (Ethem Alpaydin) to brush up on my ML basics and am facing some confusion over a derivation.
More specifically:
There we used a single input linear model $g(x) = w_1 x + w_0$ where $w_1$ and $w_0$ are the parameters to learn from data. The $w_1$ and $w_0$ values should minimize: $$E(w_1, w_0\ |\ \mathcal{X}) = \frac{1}{N} \sum_{t = 1}^N \left( r^t - (w_1 x^t + w_0) \right)^2$$ Its minimum point cna be calculated by taking the partial derivatives of $E$ with respect to $w_1$ and $w_0$, setting them equal to $0$, and solving for the two unknowns: $$ \begin{align} w_1 & = \dfrac{\sum_t^N x^t r^t - \bar{x}\bar{r} N}{\sum_t^N (x^t)^2 - N\bar{x}^2} \\ w_0 & = \bar{r} - w_1 \bar{x} \end{align} $$ where $\bar{x} = \sum_t^N x^t/N$ and $\bar{r} = \sum_t^N r^t/N$
Correct me if I'm wrong but I believe that:
$$ \begin{align} \frac{\partial E}{\partial w_1} & = \frac{2}{N}\sum_{t = 1}^N \left( r^t - w_1 x^t - w_0 \right) \cdot (-x^t) \end{align} $$
Setting $\frac{\partial E } {\partial w_1}$ to $0$ gives me:
$$ w_1 = \frac{\sum_t^N x^t r^t - w_0 \sum_t^N x^t}{\sum_t^N (x^t)^2} $$
which is different from the textbook's derivation. Am I understanding something incorrectly? Thanks.
Taking the derivatives with respect to $w_{0}$ and $w_{1}$ and equating them to $0$ we have: $$\frac{\delta E}{\delta w_{0}} = -\frac{2}{N}\sum_{t = 1}^{N}(r^{t} - w_{1}^{t}x^{t} - w_{0}) = 0$$ and: $$\frac{\delta E}{\delta w_{1}} = -\frac{2}{N}\sum_{t = 1}^{N}(r^{t} - w_{1}^{t}x^{t} - w_{0})x^{t} = 0$$
From the first, it follows that:
$$\sum_{t = 1}^{N}r^{t} - Nw_{0} - w_{1}\sum_{t = 1}^{N}x^{t} = 0 \Rightarrow \widehat{w}_{0} = \frac{\sum_{t = 1}^{N}r^{t}}{N} - \widehat{w}_{1}\frac{\sum_{t = 1}^{N}x^{t}}{N} = \overline{r} - \widehat{w}_{1}\overline{x}\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$$
where with $\widehat{w}_{0}$ and $\widehat{w}_{1}$ we refer to the estimated Least Squares estimators of $w_{0}$ and $w_{1}$ respectively. From the second, instead, it follows that:
$$\sum_{t = 1}^{N}x^{t}r^{t} - \widehat{w}_{0}\sum_{t = 1}^{N}x^{t} - \widehat{w}_{1}\sum_{t = 1}^{N}(x^{t})^{2} = 0 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)$$
Using the condition derived in $(1)$ and $\sum_{t = 1}^{N}x^{t}$ $=$ $N\overline{x}$:
$$\sum_{t = 1}^{N}x^{t}r^{t} - N\overline{x}\overline{r} + \widehat{w}_{1}N\overline{x}^{2} - \widehat{w}_{1}\sum_{t = 1}^{N}(x^{t})^{2} = 0 \Rightarrow \widehat{w}_{1} = \frac{\sum_{t = 1}^{N}x^{t}r^{t} - N \overline{x}\overline{r}}{\sum_{t = 1}^{N}(x^{t})^{2} - N\overline{x}^{2}}$$
as indeed reported in your text. It can be noticed that, if you're already familiar with the concept of Variance and Covariance, the nominator of the latter is the covariance between $r$ and $x$, while the denominator is the variance of $x$ (if we were to divide both the numerator and denominator by $\frac{1}{N}$, more precisely, where the two would then clearly cancel). Hope it clarifies.