Identical observations in linear regression

2.1k Views Asked by At

I want to do a linear regression $Y = X\beta + e$, but some of the observations (rows in $X$) are identical (about 30 000 out of 50 000 remain after deleting all duplicates), so when I try to calculate the OLS estimate $\hat{\beta} = (X^t X)^{-1} X^t Y$, the matrix $(X^t X)^{-1}$ is singular. How do I remedy this?

1

There are 1 best solutions below

4
On BEST ANSWER

We have $X\in \mathbb{R}^{m\times n}$ and more equations than variables, thus $m > n$. We want to know when

$$ X^t X \in \mathbb{R}^{n \times n} $$ is invertible. One critera is that $\ker X^t X = \{ 0 \}$. This means $$ 0 = X^t X u = \sum_j (X^t X)_{.j} u_j $$ only holds for $u = 0$, in other words the columns of $X^t X$ are linear independent.

$$ 0 = X^t X u \Rightarrow \\ 0 = u^t X^t X u = (X u)^t (X u) = \lVert X u \rVert^2 \Rightarrow \\ X u = 0. $$ So $X$ must be such, that $u = 0$ follows from $X u = 0$. Using the same arguments as for $X^t X$, this is the case if the columns of $X$ are linear independent thus if $\mbox{rank } X = n$.

This means:

  • If $X$ had linear independent columns, adding or removing copies of existing rows will give a new matrix $X$ with the same property. The only positive effect of this clean up might be saving memory and speeding up the calculation. It has no impact on $X^t X$ being singular or not.

  • If $X^t X$ is not invertible, then the column rank $r$ of $X$, which is equal to the row rank of $X$, must be smaller than $n$. This can be changed if $n-r$ rows are added that bring the resulting $X$ to $\mbox{rank } X = n$. Thus it seems you have not too much information, but instead lack information.

  • If it is not possible to add those rows, another option might be to drop unknowns, reducing $n$ towards $r$, if the correct ones are dropped. Gauss elimination either on $X$ or $X^t$ should reveal the ones.

  • Here are some remarks about the computation of the rank: computation