why is the least square cost function for linear regression convex

9k Views Asked by At

I was looking at Andrew Ng's machine learning course and for linear regression he defined a hypothesis function to be $h(x) = \theta_0 + \theta_1x_1 + \dots + \theta_nx_n$, where $x$ is a vector of values, so the goal of linear regression is to find $\theta$ that most closely estimates the real result in order to estimate how wrong the hypothesis is compared to how the data is actually distributed. He uses the least square $$ \mathrm{error} = (h(x) - y)^2, $$ where $y$ is the real result. Since there are a total of $m$ training examples he needs to aggregate them such that all the errors get accounted for. So he defined a cost function $$ J(\theta) = \frac{1}{2m}\sum_{i=0}^{m}(h(x_i) - y_i)^2, $$ where $x_i$ is a single training set.

He states that $J(\theta)$ is convex with only $1$ local optimum. I want to know why is this function convex?

2

There are 2 best solutions below

2
On BEST ANSWER

Let $x_i \in \mathbb R^n$ be the $i$th training example and let $X$ be the matrix whose $i$th row is $x_i^T$. Let $y$ be the column vector whose $i$th entry is $y_i$. Define $J:\mathbb R^n \to \mathbb R$ by $$ J(\theta) = \frac{1}{2m} \sum_{i=0}^m (x_i^T \theta - y_i)^2. $$ Notice that $$ J(\theta) = \frac{1}{2m} \| X \theta - y \|_2^2. $$ You can easily check that the function $$ f(\theta) = \frac{1}{2m} \| \theta \|_2^2 $$ is convex by checking that its Hessian is positive definite. (In fact, $$ \nabla^2 f(\theta) = \frac{1}{m} I, $$ where $I$ is the identity matrix.)

A very useful fact to be aware of is that the composition of a convex function with an affine function is convex. Noting that $J(\theta) = f(X \theta - y)$ is in fact the composition of the convex function $f$ with the affine function $\theta \mapsto X \theta - y$, we can invoke this useful fact to conclude that $J$ is convex.


An alternative approach is to compute the Hessian of $J$ directly: $$ \nabla J(\theta) = \frac{1}{m} X^T(X\theta - y) $$ and $$\nabla^2 J(\theta) = \frac{1}{m} X^T X. $$ The matrix $X^T X$ is positive semidefinite, which shows that $J$ is convex.

3
On

Defining

$$ f(\theta) = \| h(\theta,x)-y \|^2 $$

with $h(\theta,x) = \langle \theta, x \rangle$

is sufficient to prove

$$ f(\lambda \theta_1+(1-\lambda)\theta_2) \le \lambda f(\theta_1)+(1-\lambda)f(\theta_2) $$

with $0 \le \lambda \le 1$

It is laborious but it is easy to conclude that.