Referring to the definition in Wikipedia, paraphrased:
Formally, a median of a population is any value such that at most half of the population is less than the proposed median and at most half is greater than the proposed median. As seen above, medians may not be unique. If each set contains less than half the population, then some of the population is exactly equal to the unique median.
If the data set has an odd number of observations, the middle one is selected. For example, the following list of seven numbers,
If the data set has an even number of observations, there is no distinct middle value and the median is usually defined to be the arithmetic mean of the two middle values
But as all you statisticians know, this can lead to misdirection. Consider the following dataset ( R code)
foo <- rnorm(1000)
bar <- rnorm(1001,50)
median(c(foo,bar))
[1] 46.86411
But the graph shows how bad a choice that is:
hist(c(foo,bar), breaks=100)
Clearly we'd be better off picking the arithmetic mean of the max of the low peak and the min of the high peak. Is this just a case of "look at the data before deciding which statistical parameters matter"? Is there a reason the strict definition of 'median' fails to produce a useful value here?
