Calculating the median from a sample

119 Views Asked by At

Say I have a sequence $a_1\leq a_2\leq\dots\leq a_n$ of $n$ numbers. Say I pick a subsequence of $k$ samples from this sequence. Can I approximate the median of the original sequence from the sample?

At first thought, I thoguht the median of the sample works. But consider the sequence $1,2,9$ and let $k=1$. Then the expected value of the median of a $k$ -subsequence equals $4$, which is the mean but not the median.

1

There are 1 best solutions below

0
On

I am not sure about your purpose in estimating the median of a sample by taking the median of a randomly selected subsample. Generally speaking, I'd say it would work better for larger subsamples.

Suppose I have a sample of size $n = 200$ from $\mathsf{Norm}(\mu=100,\sigma = 15).$

set.seed(918)
x = rnorm(200, 100, 15)
median(x)
[1] 99.93832

Its median is 99.938.

How take 100 subsamples (without replacement) of size $n_s = 50,$ find their medians, and look at the distribution of the medians of the subsamples.

set.seed(1234)
h = replicate(100, median(sample(x,50)))
summary(h);  sd(h)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  94.96   99.09  100.51  100.51  102.15  105.92 
[1] 2.364744

The mean and median of the 100 subsample medians are both 100.51, which is not far from the median 99.94 of the whole sample.

Here is a histogram of the 100 subsample medians.

hist(h, prob=T, col="skyblue2", main="Medians of Subsamples")

If you used only one subsample of size 50, you could have gotten results between 94.96 and 105.95.

enter image description here

If I use very small subsamples of size ten, the results are even more scattered:

set.seed(1235)
h = replicate(100, median(sample(x,10)))
summary(h);  sd(h)
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  86.92   96.27  100.21   99.92  103.00  112.18 
[1] 5.79906