(ML) Gradient Descent Step Simplication Question for Linear regression

243 Views Asked by At

I am doing the Coursera Machine Learning course and there is something I am struggling with Gradient Descent applied to the linear regression.

So basically given the definition of the cost function (Mean Squared Error): $J(\theta) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i}- y_{i} \right)^2 = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2$

And the basic steps below (in a 1D case):

Repeat until convergence

  • $temp0 = \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1)$
  • $temp1 = \theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1)$
  • $\theta_0 = temp0$
  • $\theta_1 = temp1$

At some point it is shown that:

  • $\theta_0 = \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x_{i}) - y_{i})$
  • $\theta_1 = \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x_{i}) - y_{i}) x_{i}\right)$

With a bit of explanation but based on $m$ = one single element:

  1. $\frac{\partial}{\partial \theta_j} J(\theta)=\frac{\partial}{\partial \theta_j} \frac{1}{2} (h_{\theta}(x) - y)^2$

  2. $\Leftrightarrow 2 \frac{1}{2} (h_{\theta}(x) - y) \frac{\partial}{\partial \theta_j} (h_{\theta}(x) - y) = (h_{\theta}(x) - y) \frac{\partial}{\partial \theta_j}(\sum_{i=0}^{n} \theta_i x_i - y )$

  3. $\Rightarrow (h_{\theta}(x) - y) x_j$

I don't really see how we end up on step 2 from step 1, any idea?

I feel there is something missing, even if we put: $h_\theta(x)=\theta_0 + \theta_1 x_i$. I mean the partial derivative is applied on $j$...

1

There are 1 best solutions below

2
On BEST ANSWER

This is the chain rule of differentiation being applied. If you have something like $f(g(\theta_j))$, its (ordinary) derivative with respect to $\theta_j$ is $f'(g(\theta_j)g'(\theta_j) $, where $f'$ and $g'$ are the derivatives of $f$ and $g$.

The same applies here, where $f(t) = \frac{1}{2}t^2$ and $g(\theta_j) = h_{\theta_1, \dots, \theta_j, \dots, \theta_m}(x)$ with all the other $\theta_{i\neq j}$ being considered as fixed. It's just a little bit weird since $h$ is not written as being a function of $\theta$.

We have $f'(t) = t$ and $g'(\theta_j) =\frac{\partial}{\partial \theta_j} (h_\theta(x) - y)$

Applying the chain rule:

$\frac{\partial}{\partial \theta_j} \frac{1}{2}(h_\theta(x) - y)^2 = f'(g(\theta_j))g'(\theta_j)\\ = g(\theta_j)g'(\theta_j)\\ = (h_\theta(x) - y)\frac{\partial}{\partial \theta_j} (h_\theta(x) - y)$