In Bayesian inference why is $E[\hat\Theta|X] = \hat\Theta$ ?!

249 Views Asked by At

https://youtu.be/XtNXQJkgkhI?t=1147

In this MIT video on Bayesian statistical inference, starting around the 19:07 mark, the professor claims that

$E[\hat\Theta|X] = \hat\Theta$

because $\hat\Theta$ is a function of $X$.

I feel this is a trivial argument but somehow I don't quite get it.

Could someone elaborate on that a bit?

Both $\hat\Theta$ and $X$ are uppercase here i.e.
the word is about random variables and not about concrete values of random variables.

Def. of what the $E[X|Y]$ means in the context of this course. It is a random variable by definition.

definition

2

There are 2 best solutions below

5
On BEST ANSWER

I think the best way to understand this segment of the lecture is to extend the particular example that was discussed, to explicitly calculate the estimator $\hat \Theta$. Rather than using increasing levels of abstraction or formalization, we seek to illuminate by making the example more concrete.

Recall in the previous portion of the lecture that the model is $$\Theta \sim \operatorname{Uniform}(4,10), \\ X \mid \Theta \sim \operatorname{Uniform}(\Theta - 1, \Theta + 1).$$ The joint density has value $1/12$ over its support, which is a parallelogram: $$f_{\Theta, X}(\theta, x) = \frac{1}{12} \mathbb 1(4 \le \theta \le 10) \mathbb 1(\theta - 1 \le x \le \theta + 1).$$ This parallelogram is bounded by the lines $$\theta = x-1, \quad \theta = x+1, \quad \theta = 4, \quad \theta = 10.$$ When $5 \le X \le 9$, the conditional expectation $\operatorname{E}[\Theta \mid X]$ is just the midpoint between $X+1$ and $X-1$; i.e., $\operatorname{E}[\Theta \mid X] = X$. In other words, in this interval, the conditional expectation is just the parallel line between the aforementioned boundaries $\theta = x-1$ and $\theta = x+1$; i.e., $\theta = x$. However, when $3 \le X < 5$, we have to take the midpoint between $\theta = x+1$ and $\theta = 4$; i.e., $$\operatorname{E}[\Theta \mid X] = \frac{X+1+4}{2} = \frac{X+5}{2}.$$ And when $9 < X \le 11$, we similarly have $$\operatorname{E}[\Theta \mid X] = \frac{X-1+10}{2} = \frac{X+9}{2}.$$ All together, $$\hat \Theta = \operatorname{E}[\Theta \mid X] = \begin{cases}\frac{X+5}{2}, & 3 \le X < 5 \\ X, & 5 \le X \le 9 \\ \frac{X+9}{2}, & 9 < X \le 11. \end{cases}$$ You will note that this is a continuous but not everywhere differentiable function. More importantly, you will also note that $\operatorname{E}[\Theta \mid X]$ is a random variable that is solely a function of $X$, and it seeks to estimate $\Theta$ through the observed $X$. Hence $\hat \Theta = \operatorname{E}[\Theta \mid X]$ is what he calls the least mean squares estimator.

The essential claim that you have questioned is $\operatorname{E}[\hat \Theta \mid X] = \hat \Theta$. But we now see from the above example what the professor means: if $X$ is given, then $\hat \Theta$ is no longer random with respect to $X$. You know it, and its expected value is again a function of the conditional $X$. Moreover, it is unchanged. For instance, if I ask for $\hat \Theta \mid (X = 8)$, you would give me $8$. Taking the conditional expectation of $\hat \Theta$ given $X$ doesn't modify the estimate.

Another way to think of it is to suppose I let $h(X) = X^2$. Then what is $\operatorname{E}[h(X) \mid X = x]$? Well, it is just $\operatorname{E}[X^2 \mid X = x] = \operatorname{E}[x^2] = x^2$. Similarly, $\operatorname{E}[X^2 \mid X] = X^2$. And $\operatorname{E}[h(X) \mid X] = h(X)$. So $\operatorname{E}[\hat \Theta \mid X] = \hat \Theta$.

0
On

Let $X$ be a random variable mapping into the real numbers on a probability space $(\Omega,\mathcal{F},\mathbb{P})$, and assume that $X \in L^2(\Omega)$ ($X$ has a well defined variance/second moment.

Definition: We say that a random variable $Y$ is $X$-measurable if there exists a Borel-measurable, deterministic, function $f : \mathrm{Range(X)} \to \mathbb{R}$ such that: $$ Y = f(X) $$ almost-surely. This basically means that $Y$ is completely determined by $X$ - up to a deterministic transformation. This is sort of the opposite of saying that $X$ is independent of $Y$, or vice-versa.

Definition: For an $L^2$ random variable $Z$, we define the random variable: $$ \mathbb{E}[Z|X] \equiv f_Z(X) $$ to be the unique (almost - surely) random variable such that:

  1. $\mathbb{E}[Z|X]$ is $X$ measurable - so $\mathbb{E}[Z|X]$ is given by a deterministic transform of $X$ (this transformation depends on $Z$ of course as you'll see in the next line).
  2. $\mathbb{E}[Z|X]$ minimizes the mean square error: $$ \mathbb{E}[ (Z - g(X))^2] $$ over all possible borel-measurable deterministic functions $g$ (equivalently - all $X$-measurable random variables).

Thus, the conditional expectation can be interpreted as the unique deterministic transformation of $X$ that minimizes the mean squared error amongst all such deterministic transformations (Borel-measurable of course).

The general definition of measurability is more abstract/harder to motivate in general, but my biggest breakthrough in probability was that if $Y$ is $X$ measurable and $Y$ and $X$ are random variables, then $Y,X$ are related by a deterministic transform. This doesn't go both ways, for example if $Y = X^2$, then knowing $X$ gives you the value of $Y$, but knowing $Y$ does not uniquely specify $X$.