Why divide by $2m$

22.8k Views Asked by Bumbble Comm At 28 Mar 2026 - 4:32

I'm taking a machine learning course. The professor has a model for linear regression. Where $h_\theta$ is the hypothesis (proposed model. linear regression, in this case), $J(\theta_1)$ is the cost function, $m$ is the number of elements in the training set, $x^{(i)}$ and $y^{(i)}$ are the variables of the training set element at $i$

$$h_\theta = \theta_1x$$

$$J(\theta_1) = \frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2$$

What I don't understand is why he is dividing the sum by $2m$.

Original Q&A

There are 3 best solutions below

Bumbble Comm On 01 Aug 2014 - 5:50 BEST ANSWER

The $\frac{1}{m}$ is to "average" the squared error over the number of components so that the number of components doesn't affect the function (see John's answer).

So now the question is why there is an extra $\frac{1}{2}$. In short, it doesn't matter. The solution that minimizes $J$ as you have written it will also minimize $2J=\frac{1}{m} \sum_i (h(x_i)-y_i)^2$. The latter function, $2J$, may seem more "natural," but the factor of $2$ does not matter when optimizing.

The only reason some authors like to include it is because when you take the derivative with respect to $x$, the $2$ goes away.

Bumbble Comm On 01 Aug 2014 - 5:48

Dividing by $2m$ ensures that the cost function doesn't depend on the number of elements in the training set. This allows a better comparison across models.

Bumbble Comm On 04 May 2021 - 5:55

This is how they explain at Coursera (https://www.coursera.org/learn/machine-learning/supplement/nhzyF/cost-function)

The mean is halved (1/2) as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the 1/2 term.

Why divide by $2m$

There are 3 best solutions below

Related Questions in REGRESSION

Related Questions in MACHINE-LEARNING

Trending Questions

Popular # Hahtags

Popular Questions