Contour skewing in linear regression cost function for two features

261 Views Asked by At

In a Machine Learning tutorial, I came across the following equation for a the cost function of linear regression.

$$T(\theta)=(\frac{1}{2m})(\sum_1^m{(\theta_0+\theta_1x_1^i+\theta_2x_2^i+...+\theta_nx_n^i - y^i)^2}) $$

m is the number of known results and n is the number of features. Take n=2 and does not consider theta_0, and draw the T against theta_1 and theta_2, we would get a set of contours.

If the range of x1 is 0 to 1000 and the range of x2 is 0 to 1, it says that the contours skew towards the theta_2 axis. Could someone please show how one can get to that conclusion mathematically ?

1

There are 1 best solutions below

1
On

In matrix notation we have $$ T(\theta) = \frac{1}{2m}\lVert A \theta - y\rVert_2^2 $$ where $$ A = \left( \begin{array}{cc} 1_m & X \\ \end{array} \right) \in \mathbb{R}^{m\times(1+n)} $$ For $n=2$ we can draw level curves $$ C = T(\theta) = \frac{1}{2m} \lVert \theta_0 1_m + \theta_1 x_1 + \theta_2 x_2 - y \rVert_2^2 $$ Rough estimation: As $x_1$ has components up to 1000 times larger than those of $x_2$, $\theta_1$ needs to be up to $1000$ times smaller than $\theta_2$ to make a comparable contribution to $T$ like $\theta_2$. Plotted with equal scale, the contours will be slim along the $\theta_1$-axis and high along the $\theta_2$-axis.

Alternative estimation: \begin{align} \partial T/\partial \theta_i &= \frac{1}{m} \left(\theta_0 1_m + \theta_1 x_1 + \theta_2 x_2 - y\right)^T x_i \quad (i \in \{1,2\}) \\ &= \frac{1}{m} (A\theta-y)^T x_i \\ &= \frac{1}{m} \lVert A\theta-y\rVert_2 \lVert x_i \rVert_2 \cos \alpha_i \\ \end{align} So the $i$-th component of the gradient of $T$ is proportional to the length of $x_i$ which is about $\sqrt{m} \lvert \bar{x}_i\rvert$. This means the $1$-component of the gradient is up to $1000$ times larger than the $2$-component. The gradient is orthogonal to the level curves. If both axes are scaled the same, and level curves are plotted, they will be orthogonal to the gradient which points mostly in $1$-direction, so the level curve extends mostliy in $2$-direction.