Why add one to the number of observations when calculating percentiles?

7k Views Asked by At

The CFA Quantitative Methods book uses the following formula for finding the observation in a sorted list that corresponds to a given percentile $y$ in a set of observations of size $n$:

$(n + 1)\frac{y}{100}$

It defines percentile as follows: "Given a set of observations, the yth percentile is the value at or below which y percent of observations lie."

My question is, where does the $+ 1$ come from? I can see that if you wanted to ensure that all values are below a given percentile, it is useful. It also ensures the correct value for the median. But given the definition of percentile above, I would think it should be possible to have a hundredth percentile, which would be equal to the largest value. Is the "at or below" in conflict with the $+ 1$?

2

There are 2 best solutions below

0
On BEST ANSWER

You answered your own question:

"I can see that if I use the first formula to calculate the 50th percentile, the +1 ensures I get the same answer as when I calculate the median."

That's a really important property for percentiles! One you should want.

Also, if you are fitting empirical data to some parametric curve, adding +1 allows for a "tail". Many curves you would fit to have infinite support, so if you did not add +1, you would be saying your last data point is at the 100%-tile, which is usually a bad assumption.

6
On

These $+1$ terms show up a lot in counting problems because subtraction isn't quite the opposite of counting. But it is hard to see this symbolically, best to use an example.

Suppose you have $99$ people who took a test. Who is in the $99^\text{th}$ percentile? Well, precisely nobody comes above the best score, and only the best comes above the second-best. But one person is a bit over $1$%, so everyone but the first person is in the bottom $98.99$%, so you want the $99^\text{th}$ percentile to be at the last person. This is what the formula gives.

But $\frac{ny}{100}$ would also get this result, so what gives?

Well, who is in the $100^\text{th}$ percentile? Nobody, by definition. Every single person should be below this mark, which means it cannot begin at any person. This is what the $+1$ formula gives, because $\frac{(n+1)(100)}{100}>n$. if you don't have the $+1$, then you get $\frac{n(100)}{100}=n$, which would mean the best scorer did better than all the people. That would be okay for "all the other people", but E can't have done better than emself!