Asked to show that the function $g(X)$ that minimises $E[(Y-g(X))^{2}]$ is $E(Y|X)$ we heuristically add two such terms of opposite signs, then expand to get two square terms and a cross term: (This is Casella Berger, Statistical Inference exercise 4.13)
$E[(Y - E[Y|X] + E[Y|X] - g(X))^{2}] = E[(Y - E[Y|X])^{2} + (E[Y|X] - g(X))^{2} + 2(Y-E[Y|X])(E[Y|X]-g(X))]$.
One crucial step of the argument (the one I'm unsure of) is asserting that
$E[2(Y-E[Y|X])(E[Y|X]-g(X))] = 0$.
We have that $E[Y|X]$ can be thought of as a function of $X$. Let $Z = 2(Y-E[Y|X])(E[Y|X]-g(X))$, then $Z$ depends both on $X$ and $Y$.
Now $E[E[Z|X]] = E[Z]$, and in this case we choose to go the other way, namely moving from $E[Z]$ to $E[E[Z|X]]$, where
$E[Z|X] = E[2(Y-E[Y|X])(E[Y|X]-g(X)) | X] = E[Y-E[Y|X] | X] 2(E[Y|X]-g(X) = 0 \cdot 2(E[Y|X]-g(X)= 0$. The expectation on $2(E[Y|X]-g(X)$ conditioned on $X$ is the expression itself, since this is a deterministic function of $X$.
- Have I understood this correctly?
- We have an expression that is to be minimised as $g$ ranges over all functions. How could one prove this analytically?
My background is one year of undergraduate education thus far, but I'm very happy to dig deeper if it is required for understanding. Thanks in advance.
As pointed out in the comments, there are some requirements on $g$, so it should not be any function. Let $(\Omega,\mathcal{A},\mu)$ be a probability space and let $Y \in L^2(\mathcal{A})$. Recall that $L^2(\mathcal{A})$ is a Hilbert space. Consider the random variable $X:\Omega \to \mathbb{R}$ and $\sigma(X)\subset \mathcal{A}$, the $\sigma$-algebra generated by $X$. We look for $u \in L^2(\sigma(X))\subset L^2(\mathcal{A})$ such that $$E[|Y-u|^2]=\int_\Omega|Y(\omega)-u(\omega))|^2\mu(d\omega)=\|Y-u\|_2^2$$ is minimized. It can be proved that the function $u$ exists and is (a.e.) unique, and it is the orthogonal projection of $Y$ onto $\sigma(X)$, and we call $E[Y|\sigma(X)]:=u$ the conditional expectation of $Y$ given $X$. Therefore $E[Y|\sigma(X)]:\Omega \to \mathbb{R}$ is itself a random variable which is $\sigma(X)$ measurable. For $g:\mathbb{R} \to \mathbb{R}$ Borel measurable we have $$\begin{aligned}\{\omega\in \Omega:g(X(\omega))\leq a\}&=(g\circ X)^{-1}((-\infty,a])=\\&=X^{-1}(g^{-1}((-\infty,a]))=\\ &=\{\omega\in \Omega:X(\omega)\in g^{-1}((-\infty,a])\}\in \sigma(X)\end{aligned}$$ as $g^{-1}((-\infty,a]) \in \mathcal{B}(\mathbb{R})$. So if we additionally have that $$\int_\Omega |g(X(\omega))|^2\mu(d\omega)=\int_\mathbb{R}|g(x)|^2\mu_X(dx)<\infty$$ then $g(X) \in L^2(\sigma(X))$, that is, it belongs to the Hilbert space of $\sigma(X)$-measurable, square integrable functions. So $g(X)=E[Y|X]$ (a.e.) if $g$ is such that it solves the minimization problem in the OP.
Now suppose that we just have $g(X) \in L^2(\sigma(X))$. You understood correctly that to show that $$E[(Y-E[Y|\sigma(X)])(E[Y|\sigma(X)]-g(X))]=0$$ you can use the tower property: indeed, $V:=E[Y|\sigma(X)]-g(X)\in \sigma(X)$ so $$\begin{aligned}E[(Y-E[Y|\sigma(X)])V]&=E[E[(Y-E[Y|\sigma(X)])|\sigma(X)]V]=\\ &=E[0\cdot V]=0 \end{aligned}$$