Is the expectation of the error of any projection of $\mathbb{E}[Y\mid X]$ onto subspace zero?

56 Views Asked by At

If we consider the following linear predictor of $Y$ based on $X$: $$ Y_{\mathbf{b}}=\boldsymbol{\Sigma}_{Y, \mathbf{X}} \boldsymbol{\Sigma}_{\mathbf{X}}^{-1}\left(\mathbf{X}-\boldsymbol{\mu}_{\mathbf{X}}\right)+\mu_Y, $$ where $\mu_Y=\mathbb{E}[Y]$.

  1. It is well known that this is the best linear predictor of $Y$. In other words, it minimizes the population mean squared predictor error among all the linear functions of $X$.
  2. We also know that it is the $L_2$ projection of $\mathbb{E}[Y\mid X]$ onto the subspace of linear functions of $X$.
  3. Moreover, This is an unbiased linear predictor: $$\mathbb{E}[Y_{\mathbf{b}}] = \mathbb{E}[\boldsymbol{\Sigma}_{Y, \mathbf{X}} \boldsymbol{\Sigma}_{\mathbf{X}}^{-1}\left(\mathbf{X}-\boldsymbol{\mu}_{\mathbf{X}}\right)+\mu_Y] = \boldsymbol{\Sigma}_{Y, \mathbf{X}} \boldsymbol{\Sigma}_{\mathbf{X}}^{-1}\mathbb{E}[\left(\mathbf{X}-\boldsymbol{\mu}_{\mathbf{X}}\right)]+\mu_Y = \mu_Y. $$ Therefore the error $Y - Y_{\mathbf{b}}$ has mean zero.
  4. Also if one construct an estimator of $Y_{\mathbf{b}}$ using $n$ samples, we know that the sum of residuals $Y_i - \hat{Y}_{\mathbf{b},i}, i=1,...,n$ is equal to $0$, therefore the empirical mean of residuals is 0.

I'm interested in particular the fact that this $L_2$ projection of $\mathbb{E}[Y\mid X]$ onto the subspace of linear functions of $X$ is an unbiased predictor of $Y$.

Now consider we are minimizing the population mean squared predictor error over a function class $\mathcal{G}$ that is not necessarily linear in $X$, and we get the minimizer is $f^*(\cdot)\in \mathcal{G}$.

My question is - is it true that $\mathbb{E}[f^*(X)] = \mathbb{E}[Y]$? Or equivalently, I'm trying to studying the quantity the expectation of the error, i.e. $\mathbb{E}[Y - f^*(X)]$. Moreover, let $\hat{f}(\cdot)$ be an estimator of $f^*(\cdot)$ and obtained through empirical risk minimization. Is the corresponding residual $Y - \hat{f}{X}$ mean zero?

I tried to answer my problem by studying the proof of the properties of best linear predictors, but there the proofs work largely because of the nice linear structure. Therefore I appreciate any advice on this.