Regression of Y on X: best function minimizing the difference

27 Views Asked by At

In C & B, the following proof is requested:

Let X, Y be random variables. Then,

$$\min_{g(x)} E(Y - g(X))^2 = E(Y - E[Y|X])^2$$

where $g(x)$ ranges over all possible functions.

I proved the above by introducing plus/minus the term $E[Y|X]$ in the left condition, then expanding, and writing the cross-product term as integrals, equating it to zero.

However, I do not fully understand the argument here:

  1. What is the expectation I am taking here? Is it over $f(x, y)$ - the joint distribution of $X, Y$, or over $f(y)$, the marginal of $Y$?

I judged that if it were $f(y)$, then the best function producing the minimum would be the constant $E[Y]$, so it does not fit the proof.

However, if the expectation is taken over the joint distribution $f(x, y)$, then

  1. why is it called "the best predictor of Y conditional on X"?

We do not condition on anything - the expectation $E(Y - g(X))^2$ is a real number, not a function of $X$.

Also, the solution manual (which is 50% times incorrect) actually offers to use the law of iterated expectations here. However, I could only reduce the cross-product to zero by using raw integrals.

  1. How can the above be shown with the law of iterated expectations?
1

There are 1 best solutions below

0
On BEST ANSWER

#1: It's with respect to the joint distribution of $(X,Y)$.


#3: $$E[(Y-E[Y \mid X] + E[Y \mid X] - g(X))^2] = E[(Y-E[Y \mid X])^2] + E[(E[Y \mid X]-g(X))^2]$$ where the cross term vanishes because \begin{align} &E\left[ (Y-E[Y\mid X]) (E[Y \mid X] - g(X)) \right] \\ &=E\big[ E\big[ (Y-E[Y\mid X]) (E[Y \mid X] - g(X)) \mid X \big] \big] & \text{law of iterated expectation} \\ &= E\big[ (E[Y \mid X] - g(X)) \underbrace{E\big[ (Y-E[Y\mid X]) \mid X \big] }_{=0} \big] \end{align} where the last step involves pulling $(E[Y \mid X] - g(X))$ outside the inner expectation $E[\cdot \mid X]$ since it is a function of $X$. The remaining term becomes zero.


#2. It's another way of saying "best predictor of $Y$ that is a function of $X$." You are right that the notion of "best" involves randomness in both $X$ and $Y$, but the predictor itself is "conditioned on $X$" because it is a function of $X$.