This is my first ever math question so I'll try my best to make it useful for others. Here's the problem: I want to measure site load speed and decide if it is within satisfactory range or, roughly speaking, is good enough compared to what we have measured before.
Here's the method I use:
- Measure site load speeds, say 3 times, for a
control group. - Get values like
14022,11505,11795- these are in milliseconds. Yes, this is a pretty slow website with a lot of resources so the full page load takes a while. - Detect the "satisfactory range" by computing
standard deviationfor a sample of these 3 values:stdev(14022, 11505, 11795) = 1377. I use astandard deviation of a sampleformula that gives a slightly higher value.
By this point I understand the "history", i.e. how fast the website is expected to load based on previous observations, and the possible deviation, i.e. ± (1377 / 2) = ± 688.
What I will try to do next is to measure load speeds again after some update and check if anything has changed, taking into account the possible deviations and whether or not they're within the range.
- Perform 3 new measurements to a
treatment group - Get, say, values:
10494,10197,10612 - Take the
medianvalue =10494 - Calculate it's difference from the
mean averageof thecontrol group=10494 - mean_average(14022, 11505, 11795) = 10494 - 12441 = -1947 - Compare the result to the standard deviation of
control groupsample:abs(-1947) > 13775.1 Perhaps I'm wrong here and I have to compare it to the value I called "possible deviation" previously:688. - The result I got suggests that the change has
statistical significancebecause it differs from the average by more than a standard deviation or half of it.
So the conclusion I draw from the calculation above is: the changes that have been made to a website have increased its performance by ~2 seconds on average and are statistically significant.
However, I'm not sure if the reasoning I'm using is correct. Please correct me so I understand where I'm mistaken and perhaps answer the following questions:
- Do I understand
standard deviationcorrectly, as a measure of possible difference from the "norm" that happens due to measurement method imperfection? - If I do use it correctly, shouldn't I call a value of
1 standard deviationasigmaand only consider significant the changed that exceed3 sigmain magnitude? - Is it correct to calculate average of the sample to compare against? I do so because I want to count for possible outliers that have happend historically in my calculation.
- Is it correct to compare the median value of the treatment group because I want to compare a real value observed instead of imaginary average?
Thank you so much if you have made it till here, and looking forward for having your answers!
The usual approach would be this: Collect at least 10 values for the control and 10 for the treatment group. Then you run a paired t-test (technically only if values are normally distributed or you have a lot more observations) or a paired Wilcoxon rank sum test to see if there are significant differences between control and treatment.
Right now you are collecting too few observations - you could not attain statistical significance at a reasonable level with a Wilcoxon test even if treatment always has lower values than control.
A perhaps simpler alternative is to compute confidence intervals for both of your groups. These are typically based on an estimate of the standard deviation/standard error, and somewhat resemble what you do above. Confidence intervals are computed for a specific confidence level, typically 95%. If the two confidence intervals do not overlap, then there is a significant difference between the two, and you can view the treatment as successful.
Overall, these statistical tests are exactly what you need: Filter out the noise of measurement to figure out of the treatment really worked. But really, increase the sample size, a measurement just takes a few seconds.