Derivative of a cost function (Andrew NG machine learning course)

608 Views Asked by At

I'm currently doing Andrew's course, and in this course there's a part that he shows the partial derivative of the function $\frac{1}{2m}\sum_{i=1}^{m}(H_\Theta(x^i)-y^i)^2$ for both $\Theta_0$ and $\Theta_1$. But I couldn`t wrap my mind around it. I would like to see a step by step derivation of the function for both $\Theta$s.

The Hypothesis Function is defined as $H_\Theta=\Theta_0+\Theta_1x$

And the partial derivatives are

For $\Theta_0$

$\frac{1}{m}\sum_{i=1}^{m}(H_\Theta(x^i)-y^i$

For $\Theta_1$

$\frac{1}{m}\sum_{i=1}^{m}(H_\Theta(x^i)-y^i)x^i$

3

There are 3 best solutions below

4
On BEST ANSWER

Consider I have a function $u = x^2 +1$ and $f(x)=u^2=(x^2+1)^2$. Following chain rule, I will get the derivative:

$\frac{df}{dx}= \frac{df}{du} * \frac{du}{dx}= 2(x^2+1) * 2x$

Do the same with machine learning problem, call $f = \frac{1}{2m}\sum_{i=1}^{m}(H_\Theta(x^i)-y^i)^2$, just focus on $u = H_\Theta(x^i)-y^i$, we can see that:

$\frac{df}{du} = 2*(H_\Theta(x^i)-y^i)$

With $\Theta_0$: $\frac{du}{dx} = 1$

With $\Theta_1$: $\frac{du}{dx} = x^i$

Number $2$ is shortened with $2m$ equal $m$

0
On

It's just chain rule:

$$\frac{d}{d\Theta_0} \frac{1}{2m} \sum_{i=1}^{m}(H_\Theta(x_i)-y_i)^2 $$ $$=\sum_{i=1}^{m}\frac{1}{2m}\frac{d}{d\Theta_0}(H_\Theta(x_i)-y_i)^2 $$ $$ = \sum_{i=1}^{m}\frac{1}{2m}\frac{d}{d\Theta_0}(\Theta_0+\Theta_1x_i-y_i)^2$$ $$ = \sum_{i=1}^{m}\frac{1}{2m}*2*(\Theta_0+\Theta_1x_i-y_i) * \frac{d}{d\Theta_0}(\Theta_0+\Theta_1*x_i-y_i)$$ $$ = \sum_{i=1}^{m}\frac{1}{2m}*2*(\Theta_0+\Theta_1x_i-y_i) * (1)$$ $$ = \sum_{i=1}^{m}\frac{1}{2m}*2*(\Theta_0+\Theta_1x_i-y_i)$$ $$ = \frac{1}{m}\sum_{i=1}^{m}(\Theta_0+\Theta_1x_i-y_i) = \frac{1}{m}\sum_{i=1}^{m}(H_\Theta(x_i)-y_i)$$ And for $\Theta_1$: $$\frac{d}{d\Theta_1} \frac{1}{2m} \sum_{i=1}^{m}(H_\Theta(x_i)-y_i)^2 $$ $$=\sum_{i=1}^{m}\frac{1}{2m}\frac{d}{d\Theta_1}(H_\Theta(x_i)-y_i)^2 $$ $$ = \sum_{i=1}^{m}\frac{1}{2m}\frac{d}{d\Theta_1}(\Theta_0+\Theta_1x_i-y_i)^2$$ $$ = \sum_{i=1}^{m}\frac{1}{2m}*2*(\Theta_0+\Theta_1x_i-y_i) * \frac{d}{d\Theta_1}(\Theta_0+\Theta_1*x_i-y_i)$$ $$ = \sum_{i=1}^{m}\frac{1}{2m}*2*(\Theta_0+\Theta_1x_i-y_i) * (x_i)$$ $$ = \sum_{i=1}^{m}\frac{1}{2m}*2*(\Theta_0+\Theta_1x_i-y_i)x_i$$ $$ = \frac{1}{m}\sum_{i=1}^{m}(\Theta_0+\Theta_1x_i-y_i)x_i = \frac{1}{m}\sum_{i=1}^{m}(H_\Theta(x_i)-y_i)x_i$$

0
On

I think it's a bit inelegant to compute partial derivatives directly. The cleanest way to do this calculation, in my opinion, is to write the objective function as $$ L(\Theta) = \frac{1}{2m} \| X \Theta - Y \|^2 $$ where $$ X = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_m \end{bmatrix}, \qquad Y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix}, \qquad \Theta = \begin{bmatrix} \Theta_0 \\ \Theta_1 \end{bmatrix}. $$ Notice that $L(\Theta) = g(h(\Theta))$, where $$h(\Theta) = X \Theta - Y, \qquad g(u) = \frac{1}{2m} \| u \|^2. $$ The derivatives of $h$ and $g$ are $$ h'(\Theta) = X, \qquad g'(u) = \frac{1}{m}u^T. $$ By the multivariable chain rule, \begin{align} L'(\Theta) &= g'(h(\Theta)) h'(\Theta) \\ &= \frac{1}{m}(X \Theta - Y)^T X. \end{align} It follows that $$ \tag{1} \nabla L(\Theta) = L'(\Theta)^T = \frac{1}{m}X^T ( X \Theta - Y). $$


Here's another approach which has the virtue that it is quite similar to the gradient calculation required for logistic regression. Let $h_i:\mathbb R \to \mathbb R$ be the function defined by $$ h_i(u) = \frac12 (u - y_i)^2. $$ So $h_i'(u) = u - y_i$. Notice that $$ L(\Theta) = \frac{1}{m} \sum_{i=1}^m h_i(\hat x_i^T \Theta). $$ By the multivariable chain rule, the derivative of $L$ is \begin{align} L'(\Theta) &= \frac{1}{m} \sum_{i=1}^m h_i'(\hat x_i^T \Theta) \hat x_i^T \\ &= \frac{1}{m} \sum_{i=1}^m (\hat x_i^T \Theta - y_i) \hat x_i^T. \end{align} Thus, $$ \nabla L(\Theta) = L'(\Theta)^T = \frac{1}{m} \sum_{i=1}^m \hat x_i(\hat x_i^T \Theta - y_i). $$ This is equivalent to the expression (1) above.