Proof of Batch Gradient Descent's cost function gradient vector

Question

Proof of Batch Gradient Descent's cost function gradient vector

1.1k Views Asked by Bumbble Comm At 26 Mar 2026 - 6:03

In the book Hands-On Machine Learning with Scikit-Learn & TensorFlow, the author only showed the formula for the Batch Gradient Descent method, such as:

$ \dfrac{\partial}{\partial \theta_{j}} MSE(\theta)= \dfrac{2}{m}\sum_{i=1}^{m}(\theta^T \cdot \boldsymbol{x}^{(i)}-y^{(i)})\cdot x^{(i)}$

So that the gradient vector of the cost function is: $\bigtriangledown_{\theta}MSE(\theta) = \begin{bmatrix} \dfrac{\partial}{\partial \theta_{0}} MSE(\theta_0) \\ \dfrac{\partial}{\partial \theta_{1}} MSE(\theta_1) \\ \dfrac{\partial}{\partial \theta_{2}} MSE(\theta_2) \\ \vdots \\ \dfrac{\partial}{\partial \theta_{n}} MSE(\theta_n) \end{bmatrix} = \dfrac{2}{m} \cdot X^T \cdot (X \cdot \theta - y)$

The MSE cost function is defined as: $MSE(\theta) = \dfrac{1}{m}\sum_{i=1}^{m}(\theta^T \cdot \boldsymbol{x}^{(i)}-y^{(i)})^2$

Is there anyway who could kindly step by step show me the proof of the cost function's gradient vector formula (using linear algebra) above?

Original Q&A

There are 2 best solutions below

Bumbble Comm On 11 Feb 2022 - 12:33

I like the children-explanation style, I don't understand any equations the other way. And I don't want to create new notation or use different notation that has already been created. Just the same as in the book.

So, let's start from the book notation

Equation 4-3. MSE cost function for a Linear Regression model

$MSE(\theta) = \dfrac{1}{m}\sum_{i=1}^{m}(\theta^T \cdot \boldsymbol{x}^{(i)}-y^{(i)})^2$

$\theta$ is the model’s parameter vector

$y$ is the real value, $\hat{y}$ is the predicted value.

$x$ is the instance’s feature vector, containing $x_0$ to $x_n$, with $x_0$ always equal to 1.

Equation 4-1. Linear Regression model prediction

$\hat{y}=\theta_0 + \theta_1x_1 + \theta_2x_2 + \dots + \theta_nx_n$

Equation 4-2. Linear Regression model prediction (vectorized form)

$\hat{y}=\theta^Tx$

Based on defeninition we can replace $\theta^Tx^{(i)}=\hat{y}$ (Equation 4-2):

$MSE(\theta) = \dfrac{1}{m}\sum_{i=1}^{m}(\hat{y}-y^{(i)})^2$

Equation 4-5. Partial derivatives of the cost function

$\frac{\partial}{\partial\theta_j}[MSE(\theta)]=\frac{\partial}{\partial\theta_j}[\frac{1}{m}\sum_{i=1}^{m} (\theta^Tx^{(i)}-y^{(i)})^2]$

So, let's find the derivative of one item in the Gradient Vector

Move the const term (The constant factor rule)

$\frac{1}{m}(\frac{\partial}{\partial\theta_j}[\sum_{i=1}^{m} (\theta^Tx^{(i)}-y^{(i)})^2])$

Apply the chain rule often abridged to: $\frac{d}{{dx}}[h(x)] = \frac{d}{{dg(x)}}[f(g(x))]\frac{d}{dx}[g(x)]$

$\frac{1}{m}(\sum_{i=1}^{m}\frac{\partial}{\partial[\theta^Tx^{(i)}-y^{(i)}]}[(\theta^Tx^{(i)}-y^{(i)})^2]\frac{\partial}{\partial\theta_j}[\theta^Tx^{(i)}-y^{(i)}])$

Apply the power rule

$\frac{\partial}{\partial[\theta^Tx^{(i)}-y^{(i)}]}[(\theta^Tx^{(i)}-y^{(i)})^2]=2(\theta^Tx^{(i)}-y^{(i)})$

Calculate the second term

$\frac{\partial}{\partial\theta_j}[\theta^Tx^{(i)}-y^{(i)}]=\frac{\partial}{\partial\theta_j}[\theta^Tx^{(i)}]-\frac{\partial}{\partial\theta_j}[y^{(i)}]$

Apply the constant rule:

$\frac{\partial}{\partial\theta_j}[y^{(i)}]=0$

Let's calculate $\frac{\partial}{\partial\theta_j}[\theta^Tx^{(i)}]$

Based on Equations 4-1 we have

$\frac{\partial}{\partial\theta_j}[\theta^Tx^{(i)}]=\frac{\partial}{\partial\theta_j}[\theta_0 + \theta_1x_1 + \theta_2x_2 + \dots + \theta_nx_n]$

Apply the sum rule:

$\frac{\partial}{\partial\theta_j}[\theta_0 + \theta_1x_1 + \theta_2x_2 + \dots + \theta_nx_n]=\frac{\partial}{\partial\theta_j}[\theta_0] + \frac{\partial}{\partial\theta_j}[\theta_1x_1] + \dots + \frac{\partial}{\partial\theta_j}[\theta_jx_j] + \dots + \frac{\partial}{\partial\theta_j}[\theta_nx_n]$

So every term of this equation is 0 (see the constant rule), except term $j$!

$\frac{\partial}{\partial\theta_j}[\theta^Tx^{(i)}]=\frac{\partial}{\partial\theta_j}[\theta_jx_j^{(i)}]$

Apply the the constant rule and special case of the power rule

$\frac{\partial}{\partial\theta_j}[\theta_jx_j^{(i)}]=x_j^{(i)}\frac{\partial}{\partial\theta_j}[\theta_j]=x_j^{(i)}$

Join all the pieces together and simplify it a little

$\frac{1}{m}(\frac{\partial}{\partial\theta_j}[\sum_{i=1}^{m} (\theta^Tx^{(i)}-y^{(i)})^2])=\frac{1}{m}\sum_{i=1}^{m}2(\theta^Tx^{(i)}-y^{(i)})x_j^{(i)}=$

$\frac{2}{m}\sum_{i=1}^{m}(\theta^Tx^{(i)}-y^{(i)})x_j^{(i)}$

Simplify the sum notation

We are pretty close to the goal, we need to make just several Linear Algebra steps, to reach it!

The key is the $x_j^{(i)}$ term. What does it mean inside the sum notation?

Per every feature vector, we take $j^{th}$ item. Let's visualize matrix $X$

$\begin{bmatrix} x^{(1)}_1 & x^{(2)}_1 & \dots & x^{(m)}_1 \\ x^{(1)}_2 & x^{(2)}_2 & \dots & x^{(m)}_2 \\ \vdots & \vdots & \ddots & \vdots \\ x^{(1)}_j & x^{(2)}_j & \ddots & x^{(m)}_j \\ \vdots & \vdots & \ddots & \vdots \\ x^{(1)}_m & x^{(2)}_m & \dots & x^{(m)}_6 \end{bmatrix}$

You can see the row $j$ in this notation. How can we take a row instead of a column from our feature matrix? Just take the transpose!

$\frac{2}{m}\sum_{i=1}^{m}(\theta^Tx^{(i)}-y^{(i)})x_j^{(i)}=\frac{2}{m}X^T\sum_{i=1}^{m}(\theta^Tx^{(i)}-y^{(i)})$

Let's calculate the gradient now

$\nabla_{\theta}{MSE(\theta)} = \begin{bmatrix} \frac{\partial}{\partial\theta_1}[MSE(\theta)] \\ \frac{\partial}{\partial\theta_2}[MSE(\theta)] \\ \vdots \\ \frac{\partial}{\partial\theta_m}[MSE(\theta)] \end{bmatrix}= \begin{bmatrix} \frac{\partial}{\partial\theta_1}[\frac{1}{m}\sum_{i=1}^{m} (\theta^Tx^{(1)}-y^{(1)})^2] \\ \frac{\partial}{\partial\theta_2}[\frac{1}{m}\sum_{i=1}^{m} (\theta^Tx^{(2)}-y^{(2)})^2] \\ \vdots \\ \frac{\partial}{\partial\theta_m}[\frac{1}{m}\sum_{i=1}^{m} (\theta^Tx^{(m)}-y^{(m)})^2] \end{bmatrix}= \frac{2}{m}X^T\begin{bmatrix} \sum_{i=1}^{m}(\theta^Tx^{(1)}-y^{(1)}) \\ \sum_{i=1}^{m}(\theta^Tx^{(2)}-y^{(2)}) \\ \vdots \\ \sum_{i=1}^{m}(\theta^Tx^{(m)}-y^{(m)}) \end{bmatrix}= \frac{2}{m}X^T\begin{bmatrix} \theta^T\begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ x^{(1)} & x^{(2)} & \dots & x^{(m)} \\ \vdots & \vdots & \vdots & \vdots \end{bmatrix}- \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ y^{(1)} & y^{(2)} & \dots & y^{(m)} \\ \vdots & \vdots & \vdots & \vdots \end{bmatrix} \end{bmatrix}$

You can easily find that the result of

$\theta^T\begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ x^{(1)} & x^{(2)} & \dots & x^{(m)} \\ \vdots & \vdots & \vdots & \vdots \end{bmatrix}- \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \\ y^{(1)} & y^{(2)} & \dots & y^{(m)} \\ \vdots & \vdots & \vdots & \vdots \end{bmatrix} = \theta^TX-Y$

is just a vector. So it's doesn't matter how you write it because of the matrix multiplication props!

$\theta^TX-Y \iff X\theta-Y$

The answer is!

$\nabla_{\theta}{MSE(\theta)} = \frac{2}{m}X^T (X\theta-Y)$

**Bumbble Comm** · Accepted Answer

The cost function is given by

$$J = \dfrac{1}{N}\sum_{n=1}^{N}\left[\boldsymbol{w}^T\boldsymbol{x}_n-y_n \right]^2.$$

Take the total derivative

$$dJ = \dfrac{1}{N}\sum_{n=1}^N\{2\left[\boldsymbol{w}^T\boldsymbol{x}_n-y_n \right]d\boldsymbol{w}^T\boldsymbol{x}_n \}.$$

As $d\boldsymbol{w}^T$ is not dependent on the summation index $n$ we can pull it out of the sum. We can put it in front of $ \left[\boldsymbol{w}^T\boldsymbol{x}_n-y_n \right]$ because it is a scalar. Hence we obtain

$$dJ = d\boldsymbol{w}^T\left[\dfrac{1}{N}\sum_{n=1}^N\{2\left[\boldsymbol{w}^T\boldsymbol{x}_n-y_n \right]\boldsymbol{x}_n \}\right].$$

Now, we know that the term in the bracket is the gradient of $J$ with respect to $\boldsymbol{w}$. Hence,

$$\text{grad}_{\boldsymbol{w}}J=\dfrac{1}{N}\sum_{n=1}^N\{2\left[\boldsymbol{w}^T\boldsymbol{x}_n-y_n \right]\boldsymbol{x}_n \}.$$

The explanation for gradient and total derivative relationship.

Let $J(\boldsymbol{w})=J(w_0,w_1,...,w_m)$ be a multivariate function. The total derivative of $J$ is given by

$$dJ = \dfrac{\partial J}{\partial w_0}dw_0+\dfrac{\partial J}{\partial w_1}dw_1+\ldots+\dfrac{\partial J}{\partial w_m}dw_m$$ $$=[dw_0, dw_1,\ldots, dw_m][\dfrac{\partial J}{\partial w_0},\dfrac{\partial J}{\partial w_1},\ldots,\dfrac{\partial J}{\partial w_m}]^T$$ $$=d\boldsymbol{w}^T\text{grad}_{\boldsymbol{w}}J.$$

Proof of Batch Gradient Descent's cost function gradient vector

There are 2 best solutions below

$MSE(\theta) = \dfrac{1}{m}\sum_{i=1}^{m}(\theta^T \cdot \boldsymbol{x}^{(i)}-y^{(i)})^2$

$\hat{y}=\theta_0 + \theta_1x_1 + \theta_2x_2 + \dots + \theta_nx_n$

$\hat{y}=\theta^Tx$

$MSE(\theta) = \dfrac{1}{m}\sum_{i=1}^{m}(\hat{y}-y^{(i)})^2$

$\frac{\partial}{\partial\theta_j}[MSE(\theta)]=\frac{\partial}{\partial\theta_j}[\frac{1}{m}\sum_{i=1}^{m} (\theta^Tx^{(i)}-y^{(i)})^2]$

$\frac{1}{m}(\frac{\partial}{\partial\theta_j}[\sum_{i=1}^{m} (\theta^Tx^{(i)}-y^{(i)})^2])$

$\frac{1}{m}(\sum_{i=1}^{m}\frac{\partial}{\partial[\theta^Tx^{(i)}-y^{(i)}]}[(\theta^Tx^{(i)}-y^{(i)})^2]\frac{\partial}{\partial\theta_j}[\theta^Tx^{(i)}-y^{(i)}])$

$\frac{\partial}{\partial[\theta^Tx^{(i)}-y^{(i)}]}[(\theta^Tx^{(i)}-y^{(i)})^2]=2(\theta^Tx^{(i)}-y^{(i)})$

$\frac{\partial}{\partial\theta_j}[\theta^Tx^{(i)}-y^{(i)}]=\frac{\partial}{\partial\theta_j}[\theta^Tx^{(i)}]-\frac{\partial}{\partial\theta_j}[y^{(i)}]$

$\frac{\partial}{\partial\theta_j}[y^{(i)}]=0$

$\frac{\partial}{\partial\theta_j}[\theta^Tx^{(i)}]=\frac{\partial}{\partial\theta_j}[\theta_0 + \theta_1x_1 + \theta_2x_2 + \dots + \theta_nx_n]$

$\frac{\partial}{\partial\theta_j}[\theta^Tx^{(i)}]=\frac{\partial}{\partial\theta_j}[\theta_jx_j^{(i)}]$

$\frac{\partial}{\partial\theta_j}[\theta_jx_j^{(i)}]=x_j^{(i)}\frac{\partial}{\partial\theta_j}[\theta_j]=x_j^{(i)}$

$\frac{1}{m}(\frac{\partial}{\partial\theta_j}[\sum_{i=1}^{m} (\theta^Tx^{(i)}-y^{(i)})^2])=\frac{1}{m}\sum_{i=1}^{m}2(\theta^Tx^{(i)}-y^{(i)})x_j^{(i)}=$

$\frac{2}{m}\sum_{i=1}^{m}(\theta^Tx^{(i)}-y^{(i)})x_j^{(i)}$

$\frac{2}{m}\sum_{i=1}^{m}(\theta^Tx^{(i)}-y^{(i)})x_j^{(i)}=\frac{2}{m}X^T\sum_{i=1}^{m}(\theta^Tx^{(i)}-y^{(i)})$

The answer is!

$\nabla_{\theta}{MSE(\theta)} = \frac{2}{m}X^T (X\theta-Y)$

Related Questions in LINEAR-ALGEBRA

Related Questions in REGRESSION

Related Questions in MACHINE-LEARNING

Related Questions in LINEAR-REGRESSION

Related Questions in GRADIENT-DESCENT

Trending Questions

Popular # Hahtags

Popular Questions