Estimating a percentile based on running totals

243 Views Asked by At

I want to know of a way to estimate the percentile of potentially highly skewed data where I will not have access to all of the data at once. We can assume that in general, despite the fact that the data may not fit a normal bell curve, it will still be unimodal. In particular, all I have access to is information I was able to tally as I encountered each data point, such as the minimum and maximum values encountered so far, # of values encountered, the sum of all values, the sum of the squares of each value, and the sum of the cubes for each value (for computing the skew), and the sum of the 4th power of each value (for computing the kurtosis). In theory, I could track other information as I obtain each data point, but for now, this is what I have.

Obviously a very cheap estimate would be to just assume that the distribution was normal, and it is very simple to calculate what data points correspond to a given percentile figure, but this estimate does not account for how skewed the data might be, nor for how narrow its modal peak may be.

So given only the above information, is it possible to establish an estimate of what some percentile given values should be that properly accounts for how the data actually seems to be distributed based on the above tallies?

1

There are 1 best solutions below

3
On

A solution would be to compute the inverse cumulative distribution function of a Pearson system (alternatively to a Johnson system) with your estimation of avg, std, kurtosis and skewness.

In matlab you can use this function to compute the pdf https://fr.mathworks.com/matlabcentral/fileexchange/26516-pearspdf you could cumsum it to have cdf which is increasing so it can be easily inverted.

Edit : Alternatively, you can use R and use the qpearson function https://cran.r-project.org/web/packages/PearsonDS/PearsonDS.pdf which would directly answer your problem.