Best way to answer average question with large range of data

67 Views Asked by At

I have expense data for 30+ departments. I want to figure the best way to answer the question 'what is the average expense?' The problem is that each department has a different range of data and size. If I take the average of everything the answer will be skewed.

Sample data:

dept A frequency: 8
dept B frequency: 19
dept C frequency: 28

dept A: 1, 2, 3, 4, 1, 2, 4, 5
dept B: 30, 40, 20, 25, 30, 40, 40, 40, 50, 30, 35, 60, 30, 50, 55, 45, 43, 33, 32
dept C: 1000, 2000, 1500, 3000, 3240, 4000, 2300, 1000, 2000, 1500, 3000, 3240, 4000, 2300, 1000, 2000, 1500, 3000, 3240, 4000, 2300, 1000, 2000, 1500, 3000, 3240, 4000, 2300

dept A average: 2.75
dept B average: 38.32
dept C average: 2434.29

Some ideas I have are to take the average of averages, give an average for different bins of expenses (this will remove the department level), normalize the expenses, or standardize the expenses.

I am open to any suggestions on how to communicate a good answer to the question.

Thanks!

2

There are 2 best solutions below

0
On BEST ANSWER

First, you might consider why you are making this report and who will read it. As always, you should have a clear objective in mind when you summarize data. I will make a few suggestions consistent with various purposes.

If the purpose is to grasp how much money is spent overall on such expenses, and what a 'fair share' of that should be for each expense claim, then the weighted average of the answer by @SWilliams is appropriate.

Here the weight for a department is based on its number of claims ('frequeency'). For the $i$th department out of $N$, denote the frequency by $f_i$ and the average by $a_i$, Then the weighted average is

$$ A = \frac{\sum_{i=1}^N f_i a_i}{\sum_{i=1}^N f_i} = \frac{f_1 a_i + f_2 a_2 + \cdots + f_N a_N}{f_1 + f_2 + \cdots + f_N}$$

In your example, this is $$A = \frac{ 22.00 + 728.08 + 68,160.12}{8+ 19+ 28} = \frac{ 96,910.20}{55} = 1252.91.$$

Except for rounding errors in expressing the three averages to the nearest cent, this is the same as the (ordinary) mean of the 55 numbers:

     1    2    3    4    1    2    4    5   30   40   20   25   30   40   40
    40   50   30   35   60   30   50   55   45   43   33   32 1000 2000 1500
  3000 3240 4000 2300 1000 2000 1500 3000 3240 4000 2300 1000 2000 1500 3000
  3240 4000 2300 1000 2000 1500 3000 3240 4000 2300

However, if you are more interested in showing 'typical' claims for the individual departments and also for the company as a whole, you might give the medians for departments and for the whole company. This would require you to know individual claim amounts within each department. (There is no way to obtain the company median just from the department medians.)

For your example, medians are \$2.5, \$40, and \$2300 for the three departments, and \$1000 for the entire company. For your data, the medians are not much different from the corresponding means \$2.75, \$38.32, \$2435.29, and \$1252.91. For real data there might be more variability within each department department, making a list of medians more informative.

If you want to call attention to notable outliers within some departments, you could report them separately, or use boxplots that show outliers.

Here are some boxplots for fake data of my own. The frequencies of claims in the three departments are 10, 20, 30; the different widths of the boxes suggests these differences. (Boxplots do not indicate exact sample size, so it might be a good idea to list claim frequencies for departments in a caption.)

For each department, the box extends from the lower to the upper quartile (including the 'middle half' of the data). The crossbar within each box is as the department median. (The three medians are \$104, \$52, \$299 for departments, and \$128 for the company.)

The second department had a skewed distribution of claims with couple of 'outliers' at the high end. The mean for this department is \$88, considerably greater than the mean \$52 because of the outliers.

enter image description here

Finally, you need to decide whether it is important to draw attention to the variability of claims within each department. Boxplots certainly show variability, very low for my first department, rather higher for my third. More simply, you might want to give ranges in a table along with means or medians. (If your audience would understand their meaning, perhaps standard deviations.)

My fake data have department means \$103, \$88, \$288, and company mean \$191 (which you can verify from the formula above). The question you need to ponder is whether these four numbers reveal enough information. Or, perhaps, whether the boxplots reveal more than anyone really wants to know.

0
On

Given the data in your example, it makes sense to report department averages as well as the overall average (as if the data were all coming from one department) and then provide commentary around why the overall taken by itself hides important information. Showing the department averages puts the spotlight on Dept. C! This will answer the question but also protect against misinterpretation.

Note: If you report the average of averages, that result will give each department equal weight in the calculation and this may be undesirable. In particular, going that route masks the fact that Dept. C has more than 3 times the amount of data relative to Dept. A (in addition to bigger numbers). If each department was reporting the same number of data points, the overall average would equal the average of the department averages - which isn't the case here.

If you also have bins of expense, you could generate a table, such as the one below, for each bin. But then there are lots of other ways to cut the data at that level.

enter image description here