Help me with the Conditional Expectation

53 Views Asked by At

The Problem.

A person will arrive at the airport between 2PM and 3PM. The time he will arrive is the random variable X where pdf is $f(x) = 5x^4$. If he arrives before 14:30, then he can take the plane. If he took it, what the expectation of his arrival time?

$$ \mathbb{E}\left( X | X \leq \frac{1}{2} \right) = \frac{ \int_0^{1/2} x \cdot 5x^4 \, dx }{P\left( X \leq \frac{1}{2} \right)} = \frac{\left(\frac{5}{384}\right)}{\frac{1}{32}} = \frac{160}{384} = \frac{5}{12} $$

This is the answer. 5/12 = 25/60, so 14:25 is the expected time arrival. However, it's my first time seeing the conditional expectation of continuous variable, so I tried to understand it by changing it into a discrete problem.

$E(X|Y=y) = \sum_{x} xP(X=x|Y=y) = \sum_{x} x\frac{P(X=x,Y=y)}{P(Y=y)} $

This is the conditional expectation of discrete random variables, so I copied it and set X is a discrete random variable from 1 to 60 (imagining the minutes of arrival time)

$ E[X|X≤30] = \sum_{x=1}^{60} \frac{x \times P(X=x,X\le30)}{P(X\le30)}$

But, isn't $P(X=x,X\le30)$ same with $P(X\le30)$, leaving $ E[X|X≤30] = \sum_{x=1}^{60} x$ ? Of course, the value from this one is $\frac{60\times61}{2}=1830 $, which is very wierd. What am I missing?

1

There are 1 best solutions below

5
On BEST ANSWER

For the time being, let us ignore the error introduced by discretizing $X$ and look more closely at the computation of the conditional expectation of the discrete variable.

First, $X$ in the original question and solution counts the time in hours after 2:00 pm, so $0 \le X \le 1$. When you discretized $X$, you considered a random variable which we will call $Y$, which counts the number of whole minutes after 2:00 PM, so $$Y \in \{1, 2, \ldots, 60\}.$$ Then you attempted to calculate

$$\operatorname{E}[Y \mid Y \le 30] = \frac{\sum_{y=1}^{60} y \Pr[(Y = y) \cap (Y \le 30)]}{\Pr[Y \le 30]}. \tag{1}$$ The first and most important mistake you made is writing $$\Pr[(Y = y) \cap (Y \le 30)] = \Pr[Y \le 30].$$ This is incorrect because the summation in Equation $(1)$ takes place over the entire sequence of minutes from $1$ to $60$ inclusive. To really make this point concrete, we can write out a number of terms in this sum:

$$\begin{align} \sum_{y=1}^{60} y \Pr[(Y = y) \cap (Y \le 30)] &= 1 \Pr[(Y = 1) \cap (Y \le 30)] + 2 \Pr[(Y = 2) \cap (Y \le 30)] \\ &\quad + 3 \Pr[(Y = 3) \cap (Y \le 30)] + \cdots \\ &\quad + 31 \Pr[(Y = 31) \cap (Y \le 30)] \\ &\quad + 32 \Pr[(Y = 32) \cap (Y \le 30)] + \cdots \\ &\quad + 60 \Pr[(Y = 60) \cap (Y \le 30)]. \end{align}$$

Now do you see the error? For $y \in \{31, 32, \ldots, 60\}$, the joint probability $\Pr[(Y = y) \cap (Y \le 30)]$ is zero. So the correct simplification is

$$\operatorname{E}[Y \mid Y \le 30] = \frac{\sum_{y=1}^{\color{red}{30}} y \Pr[Y = y]}{\Pr[Y \le 30]}.\tag{2}$$

The second error you made is that you assumed that the arrival time in the discrete case is uniformly distributed; i.e., $\Pr[Y = y] = \frac{1}{60}$, for each $y \in \{1, 2, \ldots, 60\}$. This does not match the arrival time distribution in the continuous case, because the probability density of the arrival time $X$ is $$f_X(x) = 5x^4, \quad 0 \le x \le 1. \tag{3}$$ This tells us that it is actually more likely that the person arrives closer to 3:00 PM than 2:00 PM, since the probability density increases monotonically as $x$ goes from $0$ to $1$.

To model this behavior in a discrete way, we would need to find a probability mass function on $\{1, 2, \ldots, 60\}$ that "fits" this density. The easiest way to do this is to let $$\Pr[Y = y] = \Pr\left[\frac{y-1}{60} < X \le \frac{y}{60}\right]. \tag{4}$$ In other words, the probability that the person arrives $15$ minutes after 2 PM is equal to the probability that he arrives at any time between $14/60$ and $15/60$ hours after 2 PM. We then compute

$$\Pr[Y = y] = \int_{x=(y-1)/60}^{y/60} 5x^4 \, dx = \left(\frac{y}{60}\right)^5 - \left(\frac{y-1}{60}\right)^5. \tag{5}$$

Now we may use this to proceed with the evaluation of Equation $(2)$:

$$\begin{align} \Pr[Y \le 30] &= \sum_{y=1}^{30} \Pr[Y = y] \\ &= \sum_{y=1}^{30} \left(\frac{y}{60}\right)^5 - \left(\frac{y-1}{60}\right)^5 \\ &= \left(\frac{30}{60}\right)^5 - \left(\frac{0}{60}\right)^5 \tag{telescoping sum} \\ &= \frac{1}{32}. \end{align}$$

We also have

$$\begin{align} \sum_{y=1}^{30} y \Pr[Y = y] &= \sum_{y=1}^{30} y \left( \left(\frac{y}{60}\right)^5 - \left(\frac{y-1}{60}\right)^5 \right) \\ &= \frac{1}{60^5} \sum_{y=1}^{30} 5y^5 - 10y^4 + 10y^3 - 5y^2 + y \\ &= \frac{619312575}{60^5} \\ &\approx 0.7964410686728\ldots. \tag{6} \end{align}$$

Therefore, $$\operatorname{E}[Y \mid Y \le 30] \approx 25.48611419753,$$ but again, this is measured in minutes, not hours. If we compare the conditional expectation of $X$ measured in minutes, it would be $60 \cdot 5/12 = 25$. Where does this difference between $Y$ and $X$ of about $0.486$ minutes come from? It of course arises from the discretization of $X$ to $Y$; but more specifically, it is the particular choice of how $X$ was discretized, which we made in Equation $(4)$. This choice pushes all of the probability between the interval $X \in (y-1, y]$ to $Y = y$. As we implied earlier, if the person arrives at 2:14:30 PM, then $Y = 15$, which is $30$ seconds after the actual arrival time. So this is why the conditional expectation of $Y$ is not only larger than that of $X$, but also why the amount by which it is larger is nearly $0.5$ minutes, because on average, the mean excess of $Y$ compared to $X$ is $0.5$ minutes. And it is not exactly $0.5$ minutes, but slightly less, because as we stated before, the probability density of $X$ is not uniform; it is weighted toward later times than earlier times of arrival.

If you repeat this exercise with even finer time increments, say seconds, so $Y \in \{1, 2, \ldots, 3600\}$, you will find that the conditional expectation, when compared to $X$, will be nearly identical (when converted to the same units of time). But as you can see, this whole discretization idea is tedious and entirely unnecessary. The computation of the conditional expectation of the discretized time of arrival is actually harder than the continuous case.