Why divide by $2m$

22.8k Views Asked by At

I'm taking a machine learning course. The professor has a model for linear regression. Where $h_\theta$ is the hypothesis (proposed model. linear regression, in this case), $J(\theta_1)$ is the cost function, $m$ is the number of elements in the training set, $x^{(i)}$ and $y^{(i)}$ are the variables of the training set element at $i$

$$h_\theta = \theta_1x$$

$$J(\theta_1) = \frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2$$

What I don't understand is why he is dividing the sum by $2m$.

3

There are 3 best solutions below

4
On BEST ANSWER

The $\frac{1}{m}$ is to "average" the squared error over the number of components so that the number of components doesn't affect the function (see John's answer).

So now the question is why there is an extra $\frac{1}{2}$. In short, it doesn't matter. The solution that minimizes $J$ as you have written it will also minimize $2J=\frac{1}{m} \sum_i (h(x_i)-y_i)^2$. The latter function, $2J$, may seem more "natural," but the factor of $2$ does not matter when optimizing.

The only reason some authors like to include it is because when you take the derivative with respect to $x$, the $2$ goes away.

8
On

Dividing by $2m$ ensures that the cost function doesn't depend on the number of elements in the training set. This allows a better comparison across models.

0
On

This is how they explain at Coursera (https://www.coursera.org/learn/machine-learning/supplement/nhzyF/cost-function)

The mean is halved (1/2) as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the 1/2 term.