Computation and interpretation involving rank statistics

149 Views Asked by At

Let $X_1, ..., X_n$ be independent random variables with continuous CDF $F$.

$R_1, ..., R_n$ denotes the corresponding rank statistics, i.e. $R_i$ is the rank of $X_i$ in the order statistics $X_{(1)} \le ... \le X_{(n)}$.

Define $\overline{R}:= \frac{1}{n} \sum_{i=1}^{n}R_i$ and $\overline{i}:= \frac{1}{n} \sum_{i=1}^{n}i$ and by that $$r:=\frac{\sum_{i=1}^{n}(R_i-\overline{R})(i-\overline{i})}{{\sqrt{\sum_{i=1}^{n}\left(R_{i}-\overline{R}\right)^{2}\sum_{i=1}^{n}\left(i-\overline{i}\right)^{2}}}}.$$

  1. Find a constant $c_n$ and functions $f_i$ such that $$r=1-c_n\sum_{i=1}^{n}f_i(R_i).$$

  2. Give an interpretation for the extrema of r.

  3. Compute $\mathbb{E}[r]$ and $Var(r)$.

Now suppose there's another observartion $Y_1, ..., Y_m$ and $R_i$, $i=1, ..., n$ now denotes the rank of $X_i$ in the order statistics of $(X_1, ..., X_n, Y_1, ..., Y_m)$.

  1. Compute $\mathbb{E}[\sum_{i=1}^{n}R_i]$ and $Var(\sum_{i=1}^{n}R_i)$.

  2. Give an interpretation for the extrema of $\sum_{i=1}^{n}R_i$.

As far as I remember, we didn't treat rank statistics so far. I recall a little something about order statistics and this is supposed to be a repetition. But my trouble starts with 1. and I basically can't make any sense of the whole concept, e.g. the definition of $r$.

Can anyone explain some of this stuff to me so I might be able to tackle the problems?

[edit] Sorry, there was another fault in the formula for $r$.

[edit2] I keep trying without success. One thing that puzzles me for example: What's the difference between $\overline{R}$ and $\overline{i}$ anyway? Every index should appear once in both sums, the ordering doesn't chance the sum, so they should be the same?

4

There are 4 best solutions below

0
On BEST ANSWER

Question 2 is then that the maximum $r$ is $1$, i.e., when all Spearman's $d_i$'s are all zero, corresponding to a sample where all the $X_i$ area already in ascending numerical order in the order that they have been drawn. This is sensible, because there is then perfect positive correlation between position $i$ and rank $R_i$, as there is no randomness in that sequence so that. Minimum value is $r=-1$, when all $X_i$ are in reverse (descending) order (perfect negative correlation of $i$ with $R_i$).

1
On

If there were a second summation under the square root, in front of the (i-\overbar{i})^2, then r would be a sample correlation coefficient... Sure you have not forgotten such a summation?

0
On

I agree that the question is not sufficiently explanatory in its definitions. But I suspect that this relates to the Spearman rank correlation coefficient between two data sets $X_i$, $Y_i$, for the case where one of the data sets is not random, but is actually the natural sequence of ID tags $i$ of the numbers themselves, i.e., $X_i=i$ (=order, deterministic) and $Y_i=R_i$ (= rank of $X_i$, random). The Spearman rank correlation cff. is a Pearson product-moment correlation between ranked data with expression (see literature) $$r=1-\frac{6\sum^n_{i=1}d^2_i}{[n(n^2-1)]}$$ where $d_i=X_i-Y_i$. Hence in your case, question 1 gives $$c_n=\frac{6}{n(n^2-1)},~~~f_i(R_i)=(i-R_i)^2.$$ However, I don't know how the Spearman result is derived theoretically to arrive at this expression; you may want to search for this online or in a book on nonparametric (distribution-free) statistics.

0
On

For question 3: $$E[r] = 1 - c_n E[\sum^n_{i=1} R_i f_i(R_i)] = 1-6/[n(n^2-1)] \sum^n_{i=1} (R_i - 2 R^2_i + R^3_i)$$ and then use classical formulas for partial sums of power series. With this result $Var(R_i)= E[R^2_i]-(E[R_i])^2$, where $E[R^2_i] = \sum^n_{i=1} (R^2_i - 2 R^3_i + R^4_i)$ .