Finding weighted standard deviation

55 Views Asked by At

In short

I have 2 columns of data; quantity and time taken for that quantity of items to get from point A to B.

e.g.

Quantity Travel time (Days)
725 10
583 12
609 11
23 2
447 13
14 28

I want to find a weighted standard deviation for this data, based on the quantities.

Intuition

The rows which have large quantities are pretty consistent, while those with small quantities vary more wildly. Therefore I want to use the low quantities to negate the effect of their respective travel time.

My approach

Let $Q_i$ and $T_i$ represent the quantity and travel time of row $i$, respectively, and assume I have $n$ data points.

Let

$$Q = \sum_{i=1}^{n}Q_i.$$

Then I calculate a weighted average

$$E(T) = \sum_{i=1}^{n}\frac{Q_i}{Q}\cdot T_i.$$

A naive calculation for standard deviation would be

$$\sigma_T = \sqrt{\sum_{i=1}^{n}\frac{Q_i}{Q}\big(T_i - E(T)\big)^{2}}$$

Applying these calculations, once to the entire dataset and a second time to the dataset with the anomalous values removed, I get the following:

One (probably bad) approach of calculating weighted standard deviation on two data sets.

As you can see, the weighted mean is quite stable, whereas the weighted standard deviation varies wildly; from $1.91$ to $1.10$.

I cannot find how to attach the .xlsx file, but if needed, I can make a download link.

I need to manipulate my calculation such that the values in F5 and F7 on the first screenshot are decreased.

One approach could be to utilize something like $E(T)^{(1-w)}$ or $E(T)^{(1-w)^{\frac{1}{2}}}$ ($w$ being the weight of a datapoint); for weights close to $0$, it would be close to the weighted average (thus decreasing their impact on the standard deviation) and for weights close to $1$, this values would also be close to $1$ (thus meaning it would have a small effect).

The problem is that any manipulation that I apply takes away from this value being the true standard deviation - i.e. I wouldn't be able to confidently say that 68% of my data falls in the range bound by the standard deviation.

Any help is greatly appreciated!