Find the "average" discrete distribution for some summary statistics?

124 Views Asked by At

The new law requires companies to make summary statistics of salaries publicly available:

  1. Mean
  2. Standard deviation
  3. First quartile
  4. Median
  5. Third quartile

For $n$ people working at a company the true values of wages is a list of $n$ elements that has exactly this summary statistics. However, the number of possible lists is obviously finite!

Let's take this finite number of lists and order their elements. It is now possible to calculate the average of each $k$-th ($1 \le k \le n$) element. I think the list of average elements would be a very reasonable reconstruction of possible wages (I called it an "average" discrete distribution in the title).

How should I approach this problem? Could you suggest some references? Also, maybe there are more ways to reconstruct the sensible values easily?


Edit: after more than a year, I'm still thinking about this problem.

2

There are 2 best solutions below

1
On

I see two disadvantages with the proposed approach: First, I don't know how to calculate a representative distribution without first enumerating all possible distributions which fit the summary statistics, and, though finite under an integer assumption, there are likely prohibitively many distributions that will satisfy the statistics. Second, even if every individual distribution satisfies the summary stat criteria, there is no guarantee that the "average" of these distributions as you describe it will still satisfy the mean and standard deviation criteria.

However you go about constructing a distribution, the fact is that there are only five numbers given, and so any full distribution will have to fill in the huge information gap with some assumptions. The classical statistical way to do this is through fitting a parametric distribution to the data on hand, where the shape of the distribution fills in the gaps nicely. I think that approach would work well here, though maybe not as exciting and nonparametric as your idea. Given that this is an income distribution, I'd suggest looking at power law distributions or some other right-skewed, heavy-tailed distribution.

Depending on your application, there is another way of looking at the problem, called distributionally robust optimization (DRO). DRO is based on other optimization theory such as linear and robust optimization, and identifies decisions which work well no matter which distribution is the correct one, given the summary statistic information.

1
On

Suppose the given summary statistics are $m,s,q_1,q_2,q_3$.

One approach is to create a distribution in which

  • 26% of the values are $q_1$
  • 26% of the values are $q_2$
  • 26% of the values are $q_3$
  • 11% of the values are $x$
  • 11% of the values are $y$

Any such distribution will have the right median and quartiles, so we can solve for $x$ and $y$ to get the right mean and standard deviation.

This can lead to negative or complex solutions for $x$ and $y$ which are not reasonable salaries, but in many cases this simple procedure will produce a reasonable distribution of salaries with the desired properties.

Example: The American Medical Informatics Association reported just this data on a recent salary survey: "The overall mean (standard deviation) salary of the biomedical informatics respondents in this study was \$181,774 (\$99,566) and the median (interquartile range) was \$165,000 (\$111,000-\$230,000)."

Solving for $x$ and $y$ shows that this is consistent with a distribution where

  • 11% of the values are \$44,125
  • 26% of the values are \$111,000
  • 26% of the values are \$165,000
  • 26% of the values are \$230,000
  • 11% of the values are \$412,366