Minimize $E[(Y-g(X))^{2}]$ as $g$ ranges over all functions

313 Views Asked by At

Asked to show that the function $g(X)$ that minimises $E[(Y-g(X))^{2}]$ is $E(Y|X)$ we heuristically add two such terms of opposite signs, then expand to get two square terms and a cross term: (This is Casella Berger, Statistical Inference exercise 4.13)

$E[(Y - E[Y|X] + E[Y|X] - g(X))^{2}] = E[(Y - E[Y|X])^{2} + (E[Y|X] - g(X))^{2} + 2(Y-E[Y|X])(E[Y|X]-g(X))]$.

One crucial step of the argument (the one I'm unsure of) is asserting that

$E[2(Y-E[Y|X])(E[Y|X]-g(X))] = 0$.

We have that $E[Y|X]$ can be thought of as a function of $X$. Let $Z = 2(Y-E[Y|X])(E[Y|X]-g(X))$, then $Z$ depends both on $X$ and $Y$.

Now $E[E[Z|X]] = E[Z]$, and in this case we choose to go the other way, namely moving from $E[Z]$ to $E[E[Z|X]]$, where

$E[Z|X] = E[2(Y-E[Y|X])(E[Y|X]-g(X)) | X] = E[Y-E[Y|X] | X] 2(E[Y|X]-g(X) = 0 \cdot 2(E[Y|X]-g(X)= 0$. The expectation on $2(E[Y|X]-g(X)$ conditioned on $X$ is the expression itself, since this is a deterministic function of $X$.

  • Have I understood this correctly?
  • We have an expression that is to be minimised as $g$ ranges over all functions. How could one prove this analytically?

My background is one year of undergraduate education thus far, but I'm very happy to dig deeper if it is required for understanding. Thanks in advance.

2

There are 2 best solutions below

0
On BEST ANSWER

As pointed out in the comments, there are some requirements on $g$, so it should not be any function. Let $(\Omega,\mathcal{A},\mu)$ be a probability space and let $Y \in L^2(\mathcal{A})$. Recall that $L^2(\mathcal{A})$ is a Hilbert space. Consider the random variable $X:\Omega \to \mathbb{R}$ and $\sigma(X)\subset \mathcal{A}$, the $\sigma$-algebra generated by $X$. We look for $u \in L^2(\sigma(X))\subset L^2(\mathcal{A})$ such that $$E[|Y-u|^2]=\int_\Omega|Y(\omega)-u(\omega))|^2\mu(d\omega)=\|Y-u\|_2^2$$ is minimized. It can be proved that the function $u$ exists and is (a.e.) unique, and it is the orthogonal projection of $Y$ onto $\sigma(X)$, and we call $E[Y|\sigma(X)]:=u$ the conditional expectation of $Y$ given $X$. Therefore $E[Y|\sigma(X)]:\Omega \to \mathbb{R}$ is itself a random variable which is $\sigma(X)$ measurable. For $g:\mathbb{R} \to \mathbb{R}$ Borel measurable we have $$\begin{aligned}\{\omega\in \Omega:g(X(\omega))\leq a\}&=(g\circ X)^{-1}((-\infty,a])=\\&=X^{-1}(g^{-1}((-\infty,a]))=\\ &=\{\omega\in \Omega:X(\omega)\in g^{-1}((-\infty,a])\}\in \sigma(X)\end{aligned}$$ as $g^{-1}((-\infty,a]) \in \mathcal{B}(\mathbb{R})$. So if we additionally have that $$\int_\Omega |g(X(\omega))|^2\mu(d\omega)=\int_\mathbb{R}|g(x)|^2\mu_X(dx)<\infty$$ then $g(X) \in L^2(\sigma(X))$, that is, it belongs to the Hilbert space of $\sigma(X)$-measurable, square integrable functions. So $g(X)=E[Y|X]$ (a.e.) if $g$ is such that it solves the minimization problem in the OP.


Now suppose that we just have $g(X) \in L^2(\sigma(X))$. You understood correctly that to show that $$E[(Y-E[Y|\sigma(X)])(E[Y|\sigma(X)]-g(X))]=0$$ you can use the tower property: indeed, $V:=E[Y|\sigma(X)]-g(X)\in \sigma(X)$ so $$\begin{aligned}E[(Y-E[Y|\sigma(X)])V]&=E[E[(Y-E[Y|\sigma(X)])|\sigma(X)]V]=\\ &=E[0\cdot V]=0 \end{aligned}$$

0
On

Your derivation seems correct but there is a faster way to notice that the term is $0$. You can always write:

$$ Y = E[Y|X] + \epsilon $$

This is a tautology as the only thing we are doing is giving the name $\epsilon$ to $Y-E[Y|X]$, but turns out that decomposing Y in this way is interesting because of the properties of $\epsilon$:

$\bullet$ $\epsilon$ is conditional mean independent of X.

Proof: $E[\epsilon|X]=E[Y-E[Y|X]|X] = E[Y|X] - E[Y|X]=0$

$\bullet$ $\epsilon$ is orthogonal to any function f(X)

Proof: $E[\epsilon f(X)] = E[E[\epsilon f(X)|X]] = E[E[\epsilon|X]f(X)] = 0$

Then, you can think of $E[Y|X] - g(X)$ as $f(X)$. Regarding your second bullet point, once you eliminate the last term you have:

$$E[(Y - g(X))^{2}] = E[(Y - E[Y|X])^{2} + (E[Y|X] - g(X))^{2}$$

Where the first term does not depend on $g(X)$ so doesn't matter in the minimization problem and the second term is minimized at $0$, this is, setting $g(X) = E[Y|X]$. You may find this approach interesting if you dive into machine learning applications, where basically you use algorithms that try to learn or approximate $E[Y|X]$.