The problem I'm fighting with right now originally comes from signal noise detection: Given a prob. space $(\Omega, \mathcal{A}, \mu)$ and random variables $X, Y : \Omega \to \mathbb{R}$ (where $X$ encaptures a 'true' signal and $Y$ contains its noisy variant $Y = r(X) + \text{error}$ that was observed in reality), try to estimate $X$ from $Y$ as good as possible.
'Good' here means that given an estimation function $g : \mathbb{R} \to \mathbb{R}$, the mean square error $$ E = E[(X - g(Y))^2]$$ is as small as possible. Now generally speaking, in this pdf (slide 4) they prove that the optimal estimator is $$ \hat{X} := E(X|Y) $$
where $E(X|Y)$ is a conditional expected value, i.e. the almost surely unique function $E(X|Y) = \alpha : \Omega \to \mathbb{R}$ such that for all sets $B \in \mathcal{B}(\mathbb{R})$ (=borel sigma algebra) $$\int_{Y^{-1}(B)} \alpha(\omega) dP(\omega) = \int_{Y^{-1}(B)} X(\omega) dP(\omega)$$
i.e. on all sets of the form $Y^{-1}(B)$, the 'restricted' expected values of $E(X|Y)$ and $X$ coincide.
One can show that for sufficiently regular functions $a,b,c$, we have $$E(ab|c) = b \cdot E(a|c) ~~~~~~~~~~~~~~~~~~~~~~~(*)$$ and furthermore, by putting $B=\mathbb{R}$ we get $E(E(a|c)) = E(a)$. By some standard yoga one also shows that $E(\lambda a + b|c) = \lambda E(a|c) + E(b|c)$, that $a \geq 0$ implies $E(a|c) \geq 0$ and that $E(E(a|c)|c) = E(a|c)$.
Now the proof in the pdf goes as follows: for every estimator (probably unequal to $\hat{X}$) $g$,
$$E[(X-g(Y))^2] = E(E[(X-g(Y))^2 | Y])$$ then they analyze the inner term: $$\begin{align*}E[(X-g(Y))^2|Y] &= E[(X-\hat{X} + \hat{X} - g(Y))^2 | Y] \\ &= E[(X-\hat{X})^2 + 2([X-\hat{X}][\hat{X}-g(Y)]) + (\hat{X} - g(Y))^2 | Y] \\ &= E[(X-\hat{X})^2|Y] + 2(E[X-\hat{X}][\hat{X}-g(Y)]|Y) + E[\hat{X} - g(Y)^2 | Y] \\ &\geq E[(X-\hat{X})^2|Y] + 2(E[X-\hat{X}][\hat{X}-g(Y)]|Y) \end{align*}$$ Then they want to show that the second term is zero (I guess) and that is because of $(*)$: $$E[X-\hat{X}][\hat{X} - g(Y)]|Y = (\hat{X}-g(Y)) E[X - \hat{X}]|Y = ... \cdot E(X|Y) - E(E(X|Y)|Y)$$ and since $E(E(X|Y)|Y) = E(X|Y)$, this expression is zero.
Questions:
(1): Does that make sense (the last step where one uses $(*)$? I dont think that this is the precise reason because ...
(2): If so, isn't it true that [by the same argument]
$$E((X - \hat{X})^2|Y) = (X - \hat{X}) E[(X - \hat{X})|Y] = (X - \hat{X}) \cdot 0 = 0$$ so that not only the estimator $g(Y) = E(X|Y)$ is the best one possible but has square error 0 (i.e. it is kind of a perfect estimator??)?
Thanks in Advance,
FW
I was lacking precision. Usually one defines the restricted expected value for sigma algebras, i.e. let $\mathcal{C}$ be a sub-$\sigma$-algebra of $\mathcal{A}$. Then every function $g:\Omega \to \mathbb{R}$ that satisfies:
(1) $g$ is $\mathcal{C}$--$\mathcal{B}(\mathbb{R})$--measurable
(2) $\int_C X d\omega = \int_C g d\omega ~~ \forall C \in \mathcal{C}$
is called a restricted expected function of $X$ on $\mathcal{C}$. One immediately shows that this is almost everywhere unique, hence, just one symbol $E(X|\mathcal{C}$. Now the precise statem,ent of (*) is: For every $\mathcal{C}$--$\mathcal{B}(\mathbb{R})$--mb. function $r : \Omega \to \mathbb{R}$, we have $rX \in L^1(\Omega) \iff rE(X|\mathcal{C}) \in L^1(\Omega)$ and in this case we have the equation $$(\int_C E(rX|\mathcal{C}) d\omega =) \int_C rX d\omega = \int_C rE(X|\mathcal{C}) d\omega$$ (which is clear by definition for characteristic functions $r = 1_{C'}$ for $C' \in \mathcal{C}$ and then for linear combinations of them and then for positive $X$ and $r$ by monotone convergence and then for the general case one splits $X$ and $r$ into their positive and negative parts) meaning that the equation of the restricted expectation does not only hold for $X$ but for $X$ multiplied against an arbitrary 'test' function $r$. Now one shows as above that for any $\mathcal{C}$--$\mathcal{B}(\mathbb{R})$--mb. function $g$, $$E( (X-g)^2 ) \leq E( (X-E(X|\mathcal{C})^2 )$$ i.e. $E(X|\mathcal{C})$ is nothing else than the $L^2$--projection of $X$ onto the subspace of $\mathcal{C}$--$\mathcal{B}(\mathbb{R})$--mb. functions. In the key step one can pull out $g - E(X|\mathcal{C})$ out of $E( [g-E(X|\mathcal{C})][X - E(X|\mathcal{C}] )$ but not the other factor as $g - E(X|\mathcal{C})$ is $\mathcal{C}$--$\mathcal{B}(\mathbb{R})$--mb. (but the other factor is not necessarily, that was the error that explains why (2) in my question does not work).