$QQ$-plot - Why do we choose the empirical distribution $F_n(x) = \frac {\#\{y \in S \mid y \le x\}} n$, $S$ is sample, for comparison with normal ?
Let $S$ be our sample of size $n$. Then we form the empirical distribution $F_n$ as defined above. We then use a $QQ$-plot to compare $F_n$ to $N(0,1)$ to see if there might be a linear relationsship.
- Why do we choose $F_n$ as the emperical distribution for our sample ?
- Could we get other results if we did not choose $F_n$ as the emperical distribution ?
- For the fractile of $p \in (0,1)$ we choose the midpoint $x$ of the interval corresponding to $p$. Why do we choose the midpoint ?
I would appreciate your help.
Suppose $X_1,X_2,\ldots$ is an i.i.d. sequence of $N(0,1)$-distributed random variables. If $$ F_n(x)=\frac{\#\{1\leq i\leq n\mid X_i\leq x\}}{n}=\frac1n \sum_{i=1}^n \mathbf{1}_{\{X_i\leq x\}} $$ denotes the empirical distribution function then $$ F_n(x)\to \Phi(x) \quad\text{almost surely as}\;n\to\infty, $$ for all $x$, where $\Phi$ is the CDF of an $N(0,1)$-distribution.
This means that if you have an i.i.d. sample following an $N(0,1)$-distribution and $n$ is large enough, then $F_n$ must be "close" to $\Phi$. This is checked by plotting $F_n(x)$ against $\Phi(x)$ and checking if the points lie on the identity line.