How to calculate mean and variance of harmonic mean

175 Views Asked by At

This question is about an application in computer science, specifically it's for the information retrieval system. I'm trying to evaluate the relevance of search results using mean reciprocal rank (MRR), however to improve this method, need to somehow normalize MRR. To do the normalization I have to calculate the mean and variance of the follows.

MRR formular: $$MRR = \frac{1}{|Q|} \sum^{|Q|}\frac{1}{rank_i}$$

Some examples. Suppose we have 2 documents and 3 queries, then we can have MRR:

2 documents could generate 1/2 if the correct suggestion has been put at the 2nd place, 1 if it's been put at the 1st place. There is only one correct answer/suggestion.

  • (1 + 1 + 1)/3 = 1
  • (1 + 1 + 1/2) / 3 = 5/6
  • (1 + 1/2 + 1/2) / 3 = 2/3
  • ...

If we have 3 documents and 4 queries, then MRR could be:

3 documents could generate 1/3 if the correct suggestion has been put at the 3rd place, 1/2 if it's put at the 2nd place, 1 if it's been put at the 1st place.

  • (1/3 + 1/3 + 1/3 + 1/3) / 4 = 1/3
  • (1 + 1 + 1/2 + 1/2) = 3/4
  • (1 + 1/3 + 1/3 + 1/3) = 1/2
  • ...

Is there any way to compute the variance and mean of MRR conditional on number of documents, i.e,

VAR(MRR|num_document = 10)

E(MRR| num_document = 10)

given we could have infinity queries ($Q \rightarrow +\infty$), without simulating the permutation (kind of computer science way to solve the problem).

Update:

As pointed out in the answer, the underline distribution is not clearly defined. To clarify, here it's assumed to be uniform for the ranks.

1

There are 1 best solutions below

0
On BEST ANSWER

Without specifying a distribution for the rank of the $i^{\rm th}$ query, it is not possible to compute the desired moments. It is not clear what assumptions are being modeled. Suppose there are $Q$ queries and $d$ documents. Then if the system answers a query by randomly guessing a permutation of those $d$ documents, among which there is only one correct answer, the rank of the first correct guess is uniformly distributed on $\{1, 2, \ldots, d\}$. Thus the reciprocal rank is uniform on $\{1, 1/2, \ldots, 1/d\}$, and the $Q$ queries would be the arithmetic mean of $Q$ such IID random variables. We need not compute this distribution explicitly (which, although complicated, is tractable), since by linearity of expectation, we have (where I have denoted $\tilde R = \it{MMR}$) $$\operatorname{E}[\tilde R] = \frac{1}{Q} \sum_{i=1}^Q \operatorname{E}[1/U_i] \overset{\text{iid}}{=} \operatorname{E}[1/U_1] = \sum_{u=1}^d \frac{1}{u} \Pr[U_1 = u] = \frac{H_d}{d},$$ where $U_i \sim \operatorname{DiscreteUniform}(1,d)$ and $H_d = \sum_{u=1}^d 1/u$ is the $d^{\rm th}$ harmonic number. Due to the IID property of the $U_i$, there is also linearity of variance: $$\operatorname{Var}[\tilde R] \overset{\text{ind}}{=} \frac{1}{Q^2} \sum_{i=1}^Q \operatorname{Var}[1/U_i] \overset{\text{iid}}{=} \frac{\operatorname{Var}[1/U_1]}{Q}.$$ I leave the remainder of the computation to you as an exercise.

However, as I have stated at the very beginning, the assumed model here is that the system is randomly guessing any permutation on the $d$ documents. If this is not the case--for example, if the guessing is only limited to $d^* < d$ documents, then $\Pr[\tilde R = 0] > 0$, which is not the case in the above example. It is also possible to have radically different assumptions about the distribution of $\tilde R$ by saying, for instance, that $\tilde R$ is uniform on all of the unique values it can attain for a fixed $Q$ and $d$, since it is a random variable with discrete and finite support. But without more information from you as to what assumptions you are wanting to model, it is not possible to be more specific.