This might be something really easy or something impossible, so sorry if my question does not make sense though I feel like it should be done somehow easily.
I have the following set of given data:
| Earnings (dollars) | % of population |
|---|---|
| Less than 10,000 | 6 |
| 10,000 to 14,999 | 3.9 |
| 15,000 to 24,999 | 7.5 |
| 25,000 to 34,999 | 7.8 |
| 35,000 to 49,999 | 11.3 |
| 50,000 to 74,999 | 16.8 |
| 75,000 to 99,999 | 12.8 |
| 100,000 to 149,999 | 16.3 |
| 150,000 to 199,999 | 7.9 |
| 200,000 or more | 9.8 |
| - | - |
| Median income | 69717 |
| Mean income | 97962 |
What I need to find out is the median value of a specific bracket, let's say for people earning 50k+. I know median means the middle of the range so, I can estimate the median by halving the sum of the percentages of people earning 50k+ as follows:
(16.8 + 12.8 + 16.3 + 7.9 + 9.8) / 2 = 31.8
31.8 correlates with "a bit more" of (16.8 + 12.8) = 29.6 , so I can roughly estimate the median earning of people that earn 50k+ is "a bit more" than 100k. I basically need to transform this manual "correlation" into a formula.
I will have the same type of table for hundreds of times with different data, in MS Excel, and I need a formula that can estimate the median of a specific bracket with as little error as possible.
I hope I was clear enough and thank you in advance for any input!!
If you take the cumulative sum of the percentages of your table, you get a sequence of values of the CDF at some number, say $N$, of discrete points. If you have enough points, you can attempt to approximate the CDF by fitting it to an appropriate family of functions defined by no more than $N$ parameters (being careful about the requirement that a CDF is necessarily non-decreasing and that it attains the values 0 and 1 at the endpoints of the domain on which it is defined). After you have done that, the kinds of median calculations you want to do can be done either analytically (for some families) or at least numerically.