Building histogram of latency using sample of input data

180 Views Asked by At

Assume I have input data set consisting of a web page response time. I'd like to build histogram from input data, but for practical reasons I can only use sample of data. Based on histogram I want to answer question: what is 99.99 percentile response time? Input data does not follow normal distribution, it is most likely bi-modal or multi-modal.

Question is: how large sample do I need? Can I estimate error somehow? Is picking every X element of input data good strategy for sampling?

1

There are 1 best solutions below

1
On BEST ANSWER

I would recommend calculating a non-parametric confidence interval for the 99.99th percentile. A good explanation is given here.

For statistical accuracy, you should do random sampling with replacement. Take as large a sample as possible $(n\gg 20)$ and use the normal approximation to the binomial mentioned in the link to find the order statistics (in your sample) that correspond (as closely as possible) to your desired confidence level.