Probability that the nth highest of a set of k numbers selected from m numbers has rank greater than the observed rank within that superset

110 Views Asked by At

I'm trying to calculate the importance of a subset of genes within a ranked set of genes for a process known as gene-set enrichment analysis (GSEA). GSEA enrichment scores are usually calculated by weighting ranked genes by their observed difference between groups, but I'd like to do it non-parametrically by weighting based on their rank alone.

For this weighting, I'd like to determine the probability that each selected gene would be present at a greater rank than the observed rank. In other words...

Consider a set $K$ of $k$ genes chosen from a set $M$ of $m$ genes, where all genes in $M$ have a distinct rank, $1 ... m$. Order the observed ranks of the $k$ genes numerically from lowest to highest, i.e. $r_1 < r_2 < ... < r_k $. What is the probability that the $n^{th}$-ranked gene from a random choice of $k$ genes, $R_n$, would be greater than the observed rank $r_n$, assuming all sets of $k$ genes chosen from $M$ are equally likely ?

These numbers are large enough that I have trouble simulating probabilities; for example, I might be looking at a set of 60 genes chosen from a total set of 20,000 genes.

I think I've worked out the probability for the top-ranked gene. For one gene selected from $m$ genes, $p(R_1 > r_1) = \frac{m-r_1}{m} = 1 - \frac{r_1}{m}$.

To try to explain that a bit more, there are $m$ ways of choosing one gene from $m$ genes. If that chosen gene (which is both the top and bottom-ranked gene from the selected set) happens to be the top-ranked gene from $M$ (i.e. $r_1 = r_k = 1$), then there are $m-1$ other choices of genes with a greater rank: $2, 3, 4, ..., m$. Assuming these other selections are equally likely, the probability of choosing one of those other greater-ranked genes from $M$ is $\frac{m-1}{m}$. Similarly, if that selected gene is the second-ranked gene from $M$, then the probability of choosing another greater-ranked gene from $M$ is $\frac{m-2}{m}$, and so on, leading to the general formula $\frac{m-r_1}{m}$.

Doing some pen and paper figures of triangle numbers, it looks like for $k$ genes selected from $m$ genes, $p(R_1 \gt r_1) = \frac{(m-r_1)!(m-k)!}{m!(m-r_1-k)!}$; I can calculate this exactly in R using the log factorial function.

But I can't get my head around the generalisation to other ranks. Is it simply

$p(R_n \gt r_n) = \frac{(m-r_n)!(m-k)!}{m!(m-r_n-k)!}$

or is there something else involved that makes the solution a lot more complex?

1

There are 1 best solutions below

0
On BEST ANSWER

I hope I’ve now understood what you want to calculate (though I still don’t quote understand what you mean by “the observed rank”). As far as I understand, you have $m$ genes ranked $1$ through $m$, you uniformly randomly select $k$ of them and order their ranks (among the $m$) as $R_1$ through $R_k$ and then you want to know $P(R_n\gt r_n)$.

It’s easier to calculate $P(R_n=r_n)$, and then you can get $P(R_n\gt r_n)$ by summation. For $R_n=r_n$, we need to select $n-1$ genes from the bottom $r_n-1$ and $k-n$ genes from the top $m-r_n$, so the probability for that is

$$ P(R_n=r_n)=\frac{\binom{r_n-1}{n-1}\binom{m-r_n}{k-n}}{\binom mk}\;. $$

Then

$$ P(R_n\gt r_n)=\sum_{j=r_n+1}^{m-k+n}\frac{\binom{j-1}{n-1}\binom{m-j}{k-n}}{\binom mk}\;. $$