Parameter estimation in linear model - why standard deviation of parameter increases as X matrix gets wider?

73 Views Asked by At
Intro

Let $Y = X\beta + \epsilon$ where $X$ is randomly generated data from normal distribution fitted into $n \times m$ matrix and $\epsilon$ is a vector of normal random errors. Say that first 5 elements of $\beta$ are non-zero, and all other are 0, for example $\beta = (3,3,3,3,3,0,...0)$. From now we forget about model parameters and have only $X$ and $Y$ and try to estimate $\beta$ based on observed data. With ordinary least squares method we obtain: $$\hat\beta = (X^TX)^{-1}X^TY$$

As a part of lab assignment we are asked to try to fit linear models with different number of first $p$ columns of $X$ matrix included and see how some statistics of element $\beta_1$ will change. For example, as $\beta_1$ is a random variable, we can derive it's standard deviation. Estimator of $\sigma(\hat\beta_i$) is given by $$s(\hat\beta_1) = \sqrt{s^2(X^TX)^{-1}_{1,1}},$$ where $s^2 = \frac{1}{n-p}\sum_{i=1}^n(Y_i - \hat Y_i)^2$ for model using first $p$ columns of $X$ and coefficients $\beta_0, ... \beta_{p-1}$.

The problem

I have noticed that $s(\beta_i)$ increases as $p$ increases but can't explain why. I thought that it might be because of two potential reasons:

a. $s^2$ increases with $p$. I made a plot of $s^2$ vs $p$ and it indicates the opposite - I don't know why is that. The plot (here full $X$ is $1000 \times 950$ so the maximum $p$ is 950):

enter image description here

b. values on diagonal of $(X^TX)^{-1}$ are bigger for wider $X$ matrix - I am going to test it right now but even if it's true, I am looking for explanation why and also for interpretation of matrix $(X^TX)^{-1}$.

c. some other reason???? I have no idea what would it be.

I will appreciate even partial answers. And please correct me if anything I wrote about the theory is incorrect.

1

There are 1 best solutions below

0
On

In general, including more covariates leads to higher variance. To study this rigorously however is a bit more involved but I do not know a simpler way, so this may or may not be what you are looking for.

Let's introduce some notation that will help distinguish the different quantities.

Let's assume that $\beta_j=0$, i.e. that the true covariance of the $j$-th feature is actually zero, so that all the assumptions of the standard linear model hold whether we include/exclude the $j$-th feature. Let $X$ be the full $n \times p $ design matrix ($n$ observations, $p$ features) and further let $X_{-j}$ denote the design matrix with the $j$-th column removed (the column containing the $j$-th feature values from the data, and let $X_j$ denote the $j$-th column of $X$. Similarly let $\hat{\beta}$ be the least squares estimate on the full dataset, and $\hat{\beta}_{-j}$ be the least squares estimate when omitting the $j$-th feature. Finally, let $\hat{\beta}_{k, -j}$ denote the least squares estimate of $\beta_k$ when feature $j$ is removed from the data, with $j \neq k$.

Now, we know that $$ \text{cov}(\hat{\beta}) = \sigma^2 (X^TX)^{-1}, $$ and $$ \text{var}(\hat{\beta}_k) = \sigma^2 [(X^TX)^{-1}]_{kk}, $$ $$ \text{var}(\hat{\beta}_{k, -j}) = \sigma^2 [(X_{-j}^TX_{-j})^{-1}]_{kk}, $$ where the notation $[A]_{kk}$ denotes the $(k,k)$-th element of $A$. Recall that for a block matrix we have

$$ \begin{bmatrix} A & B \\ C & D \end{bmatrix}^{-1} = \begin{bmatrix} (A-BD^{-1}C)^{-1} & - \\ - & - \end{bmatrix} $$ where I am ignoring the $-$ terms because they are not important for our discussion. We can decompose the $p \times p$ matrix into:

$$ (X^TX)^{-1} = \begin{bmatrix} X^T_{-j}X_{-j} & X^T_{-j}X_{j} \\ X^T_{j}X_{-j} & X^T_{j}X_{j} \end{bmatrix}^{-1} = \begin{bmatrix} (X_{-j}^T X_{-j} - X_{-j}^T X_{j} (X_{j}^T X_{j})^{-1} X_j^T X_{-j} )^{-1} & - \\ - & - \end{bmatrix} $$

Therefore, we have that \begin{align*} \text{var}(\hat{\beta}_k) &= \sigma^2 [(X^TX)^{-1}]_{kk}\\ &= \sigma^2 [(X_{-j}^T X_{-j} - X_{-j}^T X_{j} (X_{j}^T X_{j})^{-1} X_j^T X_{-j} )^{-1}]_{kk} \end{align*}

Note that in the semi-definite ordering $$ X_{-j}^T X_{-j} - X_{-j}^T X_{j} (X_{j}^T X_{j})^{-1} X_j^T X_{-j} \preccurlyeq X_{-j}^T X_{-j} $$ and so inverting $$ (X_{-j}^T X_{-j} - X_{-j}^T X_{j} (X_{j}^T X_{j})^{-1} X_j^T X_{-j})^{-1} \succcurlyeq (X_{-j}^T X_{-j})^{-1}. $$ This, along with the fact that the variance we are interested in is the $(k,k)$ entry which is on the diagonal shows that $$ \sigma^2 [(X^TX)^{-1}]_{kk} = \text{var}(\hat{\beta}_k) \ge \text{var}(\hat{\beta}_{k,-j}) = \sigma^{2} [(X_{-j}^TX_{-j})^{-1}]_{kk}, $$ with equality holding if the term we subtract is zero, i.e.: $$ X_{-j}^T X_{-j} - \underbrace{X_{-j}^T X_{j} (X_{j}^T X_{j})^{-1} X_j^T X_{-j}}_{=0}, $$ this only happens when the $j$-th covariance is orthogonal to the rest of the covariates.