I am doing the Coursera Machine Learning course and there is something I am struggling with Gradient Descent applied to the linear regression.
So basically given the definition of the cost function (Mean Squared Error): $J(\theta) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i}- y_{i} \right)^2 = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2$
And the basic steps below (in a 1D case):
Repeat until convergence
- $temp0 = \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1)$
- $temp1 = \theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1)$
- $\theta_0 = temp0$
- $\theta_1 = temp1$
At some point it is shown that:
- $\theta_0 = \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x_{i}) - y_{i})$
- $\theta_1 = \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x_{i}) - y_{i}) x_{i}\right)$
With a bit of explanation but based on $m$ = one single element:
$\frac{\partial}{\partial \theta_j} J(\theta)=\frac{\partial}{\partial \theta_j} \frac{1}{2} (h_{\theta}(x) - y)^2$
$\Leftrightarrow 2 \frac{1}{2} (h_{\theta}(x) - y) \frac{\partial}{\partial \theta_j} (h_{\theta}(x) - y) = (h_{\theta}(x) - y) \frac{\partial}{\partial \theta_j}(\sum_{i=0}^{n} \theta_i x_i - y )$
$\Rightarrow (h_{\theta}(x) - y) x_j$
I don't really see how we end up on step 2 from step 1, any idea?
I feel there is something missing, even if we put: $h_\theta(x)=\theta_0 + \theta_1 x_i$. I mean the partial derivative is applied on $j$...
This is the chain rule of differentiation being applied. If you have something like $f(g(\theta_j))$, its (ordinary) derivative with respect to $\theta_j$ is $f'(g(\theta_j)g'(\theta_j) $, where $f'$ and $g'$ are the derivatives of $f$ and $g$.
The same applies here, where $f(t) = \frac{1}{2}t^2$ and $g(\theta_j) = h_{\theta_1, \dots, \theta_j, \dots, \theta_m}(x)$ with all the other $\theta_{i\neq j}$ being considered as fixed. It's just a little bit weird since $h$ is not written as being a function of $\theta$.
We have $f'(t) = t$ and $g'(\theta_j) =\frac{\partial}{\partial \theta_j} (h_\theta(x) - y)$
Applying the chain rule:
$\frac{\partial}{\partial \theta_j} \frac{1}{2}(h_\theta(x) - y)^2 = f'(g(\theta_j))g'(\theta_j)\\ = g(\theta_j)g'(\theta_j)\\ = (h_\theta(x) - y)\frac{\partial}{\partial \theta_j} (h_\theta(x) - y)$