Regression linear model without error term

703 Views Asked by At

I have this linear model from a regression:

$Y_i$ = $\beta_1X_{i1}$ + ... + $\beta_mX_{im}$ + $\epsilon_i$

The matrix representation is:

$Y$ = $X\beta$ + $\epsilon$

In a lot of places like Wikipedia they say that $Y$ = $X\beta$ is a overdetermined system (in fact it is) and then they apply least squares.

My question is: why are them trying to solve $Y$ = $X\beta$? The original system was $Y$ = $X\beta$ + $\epsilon$. Why are them ignoring the error term $\epsilon$?

My guess is that $Y$ is the real value and they are not trying to solve $Y = X\beta$ but $\hat{Y} = X\beta$ where $\hat{Y} = Y - \epsilon$ is the observed value, but I couldn't found this in any book or trustable source so maybe I'm wrong.

Thanks.

2

There are 2 best solutions below

0
On BEST ANSWER

Let's suppose we run some experiment with $m$ experimental conditions $n$ times. $Y_i$ is the outcome of the $i$th experiment and $X_{i1},\dots,X_{im}$ is the list of experimental conditions of the $i$th experiment. Let's write $X_i = (X_{i1}, \dots, X_{im})$. Then the data we observe is $(Y_i, X_i),\,i=1\dots,n$. Note that we do observe the true experimental outcome and the true experimental conditions.

Given our data we can ask: How well can our experimental outcome be described as a linear function of the experimental conditions? We can phrase this question as: How close can we get to solving the following system of $n$ equations? $$Y_i = X_i\tilde \beta, \quad i=1\dots,n.$$ In matrix notation, the system is $$Y = X \tilde\beta, \tag{1}$$ where $Y=(Y_1,\dots,Y_n)^T$ and $X$ is the matrix whose $i$th row is $X_i$. Note that $(1)$ is exactly the system of equations you are wondering about.

If we can find a solution $\beta$ to $(1)$ then all is good. Usually this will not be the case, however. Instead, we can try to find an approximate solution: a parameter vector that does not exactly solve $(1)$ but gets "close" to solving it. One way of measuring how close some parameter vector $\beta$ comes to solving $(1)$ is to define residuals $$\varepsilon_i = Y_i - X_i\beta.\tag{2}$$ Then by construction $Y_i = X_i\beta + \varepsilon_i$ holds for all $i$. Note that $\beta$ is a solution to $(1)$ if, and only if $\varepsilon_i =0$ holds for all $i$. Intuitively, $\beta$ is close to solving $(1)$ if the $\varepsilon_i$ are "close to zero". One way of measuring this closeness is by the sum of squared residuals $$\varepsilon_1^2 + \dots + \varepsilon_n^2,$$ where $\varepsilon_i$ is defined by $(2)$. The smaller the sum of squared residuals, the closer $\beta$ gets to being a solution to $(1)$. The parameter vector achieving the smallest sum of squared errors is precisely the ordinary least squares estimator $$\hat \beta = (X^TX)^{-1}X^TY.$$

Given $\hat \beta$ as an approximate solution to $(1)$ we can define $\hat Y_i = X_i\hat \beta$ and $$\hat \varepsilon_i = Y_i - X_i\hat \beta = Y_i - \hat Y_i.$$ Here, $\hat \varepsilon_i$ measures how close $\hat \beta$ gets to solving the $i$th equation in $(1)$.

There are other ways of motivating ordinary least squares but if you are wondering what role the system $Y = X \tilde\beta$ plays then in my opinion the "approximate solution to a system of equations" approach is the one to think about. One nice aspect of this approach is that it shows that linear regression can be motivated without any reference to randomness.

See here for another explanation of this approach. For a slightly different motivation of linear regression see e.g. pages 44-45 here.

2
On

The Least Squares solution provides the value of $\overrightarrow\beta$ which minimizes the norm of $\overrightarrow\varepsilon$ on the given statistic data. This way takes in account $\overrightarrow \varepsilon$ and looks the optimal one if the errors $\varepsilon_i$ are independent random values, wherein each of them satisfies to the normal random distribution law with zero expectation and standard deviation $\sigma$, $$\varphi_i(\varepsilon_i) = \dfrac1{\sigma\sqrt{2\pi}}e^{^{\Large-\frac{\varepsilon_i^2}{2\sigma^2}}}.$$

From the condidered conditions should that the distribution density function for the obtained statistics equals to the production

$$\varphi(\overrightarrow\varepsilon) = \dfrac1{\left(\sigma\sqrt{2\pi}\right)^k}e^{^{\Large-\frac{\sum\varepsilon_i^2}{2\sigma^2}}},\tag1$$ where the greatest value of this function corresponds with the least sum of the errors squares and assumed $\overrightarrow \varepsilon = \overrightarrow {Y - XB}$.

Also, the model of the errors can be presented in the spectral area as the "white noise" with the constant power spectrum density.

Besides, the equation $(1)$ can be transforme to a posteriory disribution law (Fisher's approach).

The task of the distribution law identification is very hard. I did not met this task for AR model at all.