Hopefully an easy question to answer.
But if I have a series of five water samples, each sample is a different volume with a different number of plastic particles contained within. I can calculate the concentration of each sample by dividing the mass of particles by the volume of the sample (e.g. 500 mg of particles and 1 litre of water = 500 mg/L).
If I want to understand the true average of the dataset, would it be more appropriate to take the average of the concentrations of all the samples OR take the total mass of all the samples combined and divide that by the total volume to get an average concentration? See below for example:
| Total | Sample volume (L) | Sample mass (mg) | Sample concentration (mg/L) |
|---|---|---|---|
| 4 | 253 | 63.3 | |
| 6 | 439 | 73.2 | |
| 5 | 205 | 41.0 | |
| 9 | 226 | 25.1 | |
| 4 | 30 | 7.5 | |
| Total | 28 | 1153 |
Would the correct average in this case be 42.0 mg/L (the average of the calculated concentrations) OR the calculated average that divides the total sum of the sample mass (i.e. 1153 mg) by the total sample volume (28 L), i.e. 41.2 mg/L?
I know in this case the discrepancy between the values is small, however if you expand the dataset this discrepancy will increase.
I feel the answer to this question is a simple mathematical response, but I just can't seem to come to it myself.
I appreciate the help!
Generally speaking, the total mass divided by the total volume is the more accurate estimate of the mean particulate concentration, under the assumption that samples are independent and taken at random from a population that doesn't change from sample to sample.
To understand why, consider the following extreme scenario: you take two samples, one that is $1000$ liters, and one that is $0.00001$ liters (i.e., $0.01$ mL). The first sample has $42$ grams of particulates. The second sample has $0$ mg, because the sample volume is so small that your equipment cannot measure any detectable particulates.
Yet if you take the naive average of these two, you would obtain an estimate of $21$ mg per liter. This is plainly contradictory to one's intuition. Although this is just a thought experiment, it shows that we must take into consideration the volume of each sample as a weighting factor for how much influence its concentration contributes to the average, since a large volume that was sampled should be in a sense be "more representative" of the true mean particulate concentration than a small sample.
That said, if all volumes sampled were identical, the two methods of computing the sample mean would be equivalent: if you captured, say, exactly $5$ L every time, it would make no difference which method you use. But it is precisely because some volumes were substantially larger, and others smaller, that a weighted average becomes necessary.
Algebraically, if we have $n$ samples and the $i^{\rm th}$ sample has volume $w_i$ and particulate mass $x_i$, then the sample mean concentration should be calculated as $$\bar x = \frac{\sum_{i=1}^n x_i}{\sum_{i=1}^n w_i}.$$