Understanding the notion of percentiles

128 Views Asked by At

I want to preface this by saying I am a beginner in probability.

I recently began reading the book "Algorithms to Live By", and so have been thinking about the optimal stopping problems for a while, specifically the Secretary Problem, and how this problem would change if information was provided regarding the distribution of the data points. I hence tried to understand what percentiles really tell you about relative standing.

Here is my operational definition:

Given a data set of size n, the percentile of a data point $X$ is value $ \pi(X) = \frac{|\{Y | S(Y)\leq S(X)\}|}{n}$ where $S:D\rightarrow \mathbb{R}$ gives the score of a data point.

So, for example, if $n=100$ and $X$ is in the $95'th$ percentile, $.95 = \pi(X) = \frac{95}{100}$ as expected.

Now here's a question:

Suppose I have a data set of size $n$ and I pick a data point $X$ at random. Define $P_{y}(X)$ to be the probability that for this $X$ we have that $\pi({X}) > y$ for a given $y$. What, then, is a closed form for $P_{y}(X)$ in terms of $n$ and $y$?

Put more informally, given a random data point, what is the chance of it being in the $p'th$ or higher percentile for a given $p$.

My informal attempt:

For any given $y$, we have that $P(\pi(X) > y)$ is the same as $P( \frac{|\{Y | S(Y)\leq S(X)\}|}{n} > y)$. Obviously, we need that $y\leq1$.

Going back to the earlier example, with $n=100$, intuition tells us that $P(\text{ $X$ is in the $95'th$ percentile})$ should equal $.5$ percent. Since $95$ percent of data points $Y$ are s.t. $S(Y) < S(X)$.

We furthermore intuit that $P(\pi(X)>y)$ should increase if $y$ is small, but decrease if $y$ is large -- since more data fall above the $20'th$ percentile than the $90'th$. So we guess a solution: $P(\pi(X)>y) \approx 1 - \pi(X)$. This intuitively makes sense since $\pi(X)$ of the data has score less than $X$, and so for a given score $y$, we have that $1-\pi(X)$ of the data has score more than it.

Problem

How to formalize these notions? So far I have carried on informally and intuitively. How are these notions standardly defined in probability theory, and how my definitions and intuitions compare?

1

There are 1 best solutions below

11
On BEST ANSWER

Note: Edited based on clarifications in the comments.


Let's say we have our dataset $D:=[X_1, X_2,...,X_n]\;,\; X_i \sim F_X$

We can translate this to the "scored" dataset $Q:=S(D) = [S(X_1), S(X_2),...,S(X_n)] := [S_1,S_2,...,S_n] $

We will also use "order statistic" notation to express the ranking of these scores in ascending order: $S_{(i)}$ means the $i-$th largest score in $Q$. We will also define $S_{(0)}= -\infty$ to simplify notation.

We can now define our sample percentile function (also called the empirical cumulative distribution function):

$$\pi_n(X) := \frac{|\{y:S(y) \leq S(X),\;y \in D\}|}{n}=\frac{|\{s:s \leq S(X),\;s \in Q\}|}{n}$$

Now, we will define $P_y(X;n)$ similarly to how you have (but explicitly mention $n$):

$$P_y(X;n) := P(\pi_n(X)>y),\;\;X\sim F_X,\;y \in [0,1]$$

The $n$ points in $Q$ define the locations where the value of $\pi_n$ changes:

$$\text{ Let }s\in\mathbb{R},\;\; i \in \{0,...,n-1\}:\;\;S_{(i)}\leq s < S_{(i+1)} \implies \pi_n(s) = \frac{i}{n}$$

So our $\pi_n$ function has "jumps" at each data point.

This means that

$$P_y(X;n) = P(S(X)\geq S_{(i)}) \text{ where $i$ is the smallest integer such that } \frac{i}{n} > y$$

In general, for a given sample, the sample percentile function $\pi_n(x)$ will not match the "true" percentile function $\pi(x):= P(S(X)\leq x)=F_X(x)$. The study of the behavior of order statistics is a major field of study in statistics and the results are not simple.

However, the limiting behavior is much nicer -- specifically by invoking the Glivenko-Cantelli theorem we know that as the sample size gets larger, $\pi_n \to \pi$ as follows:

$$ \sup_{x \in \mathbb{R}} \left|\pi_n(x)-\pi(x)\right| \xrightarrow{a.s.} 0$$

In other words, the sample percentile function approaches the true percentile function uniformly.

So, your intuition was correct that $P_y(X;n)\approx 1-\pi_n(y)$. $P_y(X;n)$ is maximized at $y=0$ since $S(X)$ will have to be equal to or less than the smallest observed score in the dataset.