Inequality of expectation of two random variables

49 Views Asked by At

I try to understand a proof about minimax rates for an estimator class. However, one step remains unclear:

Assume to observe $N$ independent samples from a distribution $\mathcal{P}$: $(X_i, W_i, Y_i)_{i=1}^N \sim \mathcal{P}^N $, where $X_i$ denotes the random vector of covariates for unit $i$, $W_i$ specifies whether the unit $i$ was treated or not $(0 $ or $ 1)$, and $Y_i$ is the dependent random variable. Then $Y_i$ can be written as:

$Y_i = W_i Y_i(1) + (1-W_i)Y_i(0)$, where $Y_i(0) \in \mathbb{R}$ denotes the potential outcome of unit $i$ that could be observed if $i$ were assigned to the control group, whereas $Y_i(1) \in \mathbb{R}$ is the potential outcome of $i$ under treatment.

Define now the "treatment function" $\tau(x) := \mathbb{E}[ Y(1) - Y(0) \mid X = x ]$ and assume $\tau(x)$ to be linear, i.e. $\tau(x)=x^T\beta$, with $\beta \in \mathbb{R}^d$.

I am now interested in the following expectation: $\mathbb{E} \Big[ (\tau(\mathcal{X}) - \hat{\tau}(\mathcal{X}))^2\Big] = \mathbb{E} \Big[ (\mathcal{X}(\beta - \hat{\beta}))^2 \Big]$, where the expectation is taken over the training data set $(X_i, W_i, Y_i)_{i=1}^N \sim \mathcal{P}^{N}$, and $\mathcal{X}$, which is distributed according to the marginal distribution of $X$ in $\mathcal{P}$.

The authors make the following claim: $\mathbb{E} \Big[ (\tau(\mathcal{X}) - \hat{\tau}(\mathcal{X}))^2 \Big] = \mathbb{E} \Big[ (\mathcal{X}(\beta - \hat{\beta}))^2 \Big] \leq \mathbb{E} \big[ \Vert \mathcal{X} \Vert^2 \big] \mathbb{E} \big[\Vert \beta - \hat{\beta}\Vert^2 \big]$.

My attempt to understand how they reach this inequality:

$\mathbb{E} \Big[ (\tau(\mathcal{X}) - \hat{\tau}(\mathcal{X}))^2 \Big] = \mathbb{E} \Big[ (\mathcal{X}(\beta - \hat{\beta}))^2 \Big] \overset{Cauchy-Schwarz}{\leq} \mathbb{E} \Big[ \Vert \mathcal{X} \Vert^2 \Vert \beta - \hat{\beta}\Vert^2 \Big] \\ \overset{\perp} {=} \mathbb{E} \big[ \Vert \mathcal{X} \Vert^2 \big] \mathbb{E} \big[\Vert \beta - \hat{\beta}\Vert^2 \big]$ ,

where $\perp$ means independence of the random variables. However, there is no reason for $\mathcal{X}$ and $(\beta - \hat{\beta})$ (which relies on the training data set) to be independent in my opinion. Which other inequality could be used to reach the claim made by the authors?

Thanks a lot!