Learn Noise / Error in Least Squares If We Know Its Form?

375 Views Asked by At

Let's say I am doing linear regression and I have a data matrix $A$. And, I know the noise $e_i$, is zero mean (and perhaps we know the distribution, too):

$$y_i = a_i^T x + e_i$$

Obviously from the data, I want to learn the parameters, $x$:

$$y=Ax$$

but, I'm wondering is there any reason that I cannot learn the noise, too, if I know it has a specific form like this?

In other words, I would create a matrix like this:

$$ \hat A=\begin{bmatrix} a_{11} & a_{12} & \dots & a_{1n} & 1 & 0 & \vdots \\ a_{21} & a_{22} & \dots & a_{2n} & 0 & 1 & \vdots \\ \vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \dots & a_{mn} & 0 & \dots & 1 \\ 0 & 0 & 0 & 0 & 1 & 1 & 1 \\ \end{bmatrix} $$ and a solution vector like this

$$ \hat x \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots\\ w_{1} \\ w_{2} \\ \vdots \\ w_{m} \\ \end{bmatrix} $$ and a known vector like this: $$ \hat y \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots\\ y_{m} \\ 0 \\ \end{bmatrix} $$

The last row is to ensure that the errors sum to be zero.

I am wondering if I will get much greater accuracy if I learn the regression this way? Thanks.

2

There are 2 best solutions below

0
On

The noise has zero mean.
It doesn't mean any realization of it has zero sum (What you try to impose).

Actually if you work the problem as a Maximum Likelihood problem you could see it works according to the noise properties.

3
On

In your formulation you have $n + m$ unknowns and only $m + 1$ equations, so the problem is underdetermined. There will be an entire affine subspace of solutions to the new least squares problem, and different elements of this solution space will yield wildly different predictions when exposed to new testing data.

The approach could fail in a particularly striking way if you create a problem instance where $\sum_i y_i = 0$. In this case, you could have a solution with $x_i = 0$ (for $i = 1,\ldots, n$) and $w_i = y_i$ (for $i = 1,\ldots,m$).


By the way, if you want to use our knowledge of the form of the noise, a natural way to do that is to use maximum likelihood estimation. In fact, this leads to the standard least squares formulation for linear regression. So the standard least squares formulation is already using our knowledge of the noise distribution. Here are some details:

We assume that the random variables $Y_i$ are independent and that the probability density function for $Y_i$ is $$ p_i(y_i) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(y_i - a_i^T x)^2}{2\sigma^2}}. $$ The probability density of observing the given values $y_1,\ldots, y_m$ for the random variables $Y_1,\ldots,Y_m$ is $$ L(x) = \Pi_{i=1}^m p_i(y_i). $$ We would like to select $x$ to maximize $L(x)$, but it will simplify the math to maximize $\log(L(x))$ instead: $$ \log(L(x)) = \sum_{i=1}^m -\frac{(y_i - a_i^T x)^2}{2\sigma^2} + \text{terms that do not depend on $x$}. $$ So, maximizing $L(x)$ is equivalent to minimizing $\sum_{i=1}^m (y_i - a_i^T x)^2$. But this is just our standard least squares problem.