Relation between sampling size and estimation error

74 Views Asked by At

Let $x\sim\mathcal{N}(\mu_x, \sigma_x^2)$ and $y\sim\mathcal{N}(\mu_y, \sigma_y^2)$ be two real random variables, normally distributed as above.

Let also $Q$ be the following non-negative quantity: $$ Q = \left\lvert\frac{1}{n}\sum_{i=1}^{n}y_i - \frac{1}{n}\sum_{i=1}^{n}x_i\right\rvert, $$ where $x_i$'s, $y_i$'s are drawn from the distributions $\mathcal{N}(\mu_x, \sigma_x^2)$ and $\mathcal{N}(\mu_y, \sigma_y^2)$, respectively.

If $n\to\infty$, then $Q\to\lvert\mu_y-\mu_x\rvert$. Is there any error bound for the above estimation? To put it better: can we have a rule about the sampling size $n$, so that the above estimation of that distance between the sample means are of some given accuracy $\epsilon$?

More interestingly, if $\mathbf{x}\sim\mathcal{N}(\mu_x, \Sigma_x)$ and $\mathbf{y}\sim\mathcal{N}(\mu_y, \Sigma_y)$ are multi-variate ($d$-dimensional) normal vectors, of given means and covariance matrices, and $\mathbf{x}_i$, $\mathbf{y}_i$ are drawn from them, respectively, what would be the error for the estimation of the norm $\lVert\mu_y-\mu_x\rVert$ by the quantity $$ \left\lVert\frac{1}{n}\sum_{i=1}^{n}\mathbf{y}_i - \frac{1}{n}\sum_{i=1}^{n}\mathbf{x}_i\right\rVert $$

Again, what I'm looking for, is a relation between the sample size $n$ and the dimensionality $d$, so that the estimation is of some accuracy (say $\epsilon$).

It seems reasonable to me that, if the dimensionality of the variables $\mathbf{x}$ and $\mathbf{y}$ are getting high, then the sample size $n$ needed to achieve a "good" (of given error) estimation of the above quantity, would get high, too. The question is how much high? I guess it wouldn't be just a polynomial relation, but maybe exponential. But this is only a guess.

The above might be trivial for statisticians, but I'm not sure even about where to look for. Could you help me with some guidance?

1

There are 1 best solutions below

3
On BEST ANSWER

Comments:

You could start by using Chebyshev's Inequality to get a bound on $|\bar X_n - \mu_x|$ that shows $\bar X_n \stackrel{prob}{\rightarrow} \mu_x.$ Similarly for the $Y_i,$ and so on.

However, in practice, it might be more useful to find the normal distribution of $\bar Y_n - \bar X_n$, and observe that its variance decreases to $0$ with increasing $n.$ For specified numerical values of the parameters, you could get the exact distribution of $Q,$ based on this normal distribution. (The Chebyshev bounds are good enough to show convergence in probability, but generally too loose to be useful in practice. Chebyshev's Inequality works for all distributions with a finite variance, so the bounds tend not to be very tight for any one distribution.)

If you look at the development of 'Hotelling's $T^2$ distribution' you might find approaches useful for the multivariate version of your question.