Suppose we have an order set of data $\mathcal X=\{x^{(1)},x^{(2)},...,x^{(n)}\}$ such that $x^{(1)}\le x^{(2)}\le ...\le x^{(n)}$. For some reason in my course's definition of empirical quantile, we write the probability of getting a value less than $x^{(i)}$ as:
$$ p_i=\frac{i}{n+1} $$
I don't understand why we divide by $n+1$ instead of $n$, because it means that the probability $p_n$ of getting a value less than the highest value in our data set is less than 1 (i.e. $p_n=\frac{n}{n+1}<1$). But clearly we should have $p_n=1$? So how come this definition is used for the empirical quantile?
Thanks a lot!
I'm not sure what you mean by "... getting a value less than ..." but the expression you show for $p_i$ arises in the context of order statistics as follows.
If the underlying random variable has a uniform distribution, then $p_i = i/(n+1)$ is the expected value of the $i$th order statistic.
The probability that the $i$th order statistic is less than $x$ is the probability of the union of mutually exclusive events where at least $j \geq i$ of the samples are less than $x$ but the remaining $n-j$ samples are greater than $x$. There are ${n}\choose{j}$ different combinations satisfying this condition. The probability of this union is a sum of probabilities over $j=i, \ldots,n$.
Assuming an iid sample of size $n$, the distribution function for $X_{(i)}$ is
$$\mathbb{P} (X_{(i)}\leq x) = \sum_{j=i}^{n} {{n}\choose{j}}[F(x)]^j[1-F(x)]^{n-j}$$
To get the density function, take the derivative of the distribution function with respect to $x$ to obtain
$$f_{(i)}(x) = n{n-1 \choose i-1}F(x)^{i-1}(1-F(x))^{n-i}f(x).$$
Assuming a $U(0,1)$ distribution, we have $f(x) = 1$ and $F(x) = x$. Integrating the product $xf_{(i)}(x)$ over $[0,1]$ we find the expected value
$$E(X_{(i)})=\frac{i}{n+1}.$$