I am reading Peter Flach's introduction to machine learning and the following paragraph p.345 (top of the page) here for measuring the performance of a binary classifier.
In machine learning the situation is usually more concrete, and our experimental objective – accuracy, say – is something we can measure in principle, or at least estimate (since we’re generally interested in accuracy on unseen data). However, there may be unknown factors we have to account for. For example, the model may need to operate in different operating contexts with different class distributions. In such a case we can treat accuracy on future data as a random variable and take its expectation, assuming some probability distribution over the proportion of positives pos. Since $acc = pos·tpr + (1 − pos)·tnr$ where (tpr = true positive rate, tnr = true negative rate), and assuming we can measure true positive and negative rates independently of the class distribution, we have (assuming a uniform distribution over pos)
$\mathbb{E}[acc] = \mathbb{E} [pos·tpr +(1−pos)·tnr] = \mathbb{E}[pos]tpr+E[1-pos]tnr = tpr/2+tnr/2$.
I do not understand this statement. What is the relationship between the proportion of positives and the random variable (the accuracy of the model)? Any insights appreciated.
I think this is what the author is saying:
The formula $acc = pos \cdot tpr + (1 - pos) \cdot tnr$ describes how $acc$ varies as a function of $3$ parameters: $pos, tpr, tnr$.
The formula itself is based on the law of total probability, but that's irrelevant here. Here's a useful example:
$acc = P(Test=Disease)$
$pos = P(Disease=1)$
$tpr = P(Test=1\mid Disease=1)$
$1-pos = P(Disease=0)$
$tnr = P(Test=0 \mid Disease=0)$
The parameters $tpr = P(Test=1\mid Disease=1)$ and $tnr = P(Test=0 \mid Disease=0)$ are constants via "assuming we can measure true positive and negative rates independently of the class distribution".
The parameter $pos = P(Disease=1)$ is unknown via "the model may need to operate in different operating contexts with different class distributions", so we model $pos$ itself as a random variable, representing "some probability distribution over the proportion of positives pos" (presumably, varying due to "different operating contexts").
Then by linearity we have:
$$\mathbb{E}[acc] = \mathbb{E} [pos·tpr +(1−pos)·tnr] = \mathbb{E}[pos]tpr+\mathbb{E}[1-pos]tnr $$
IMHO everything so far is non-controversial. However, the author then made one more assumption:
By further "assuming a uniform distribution over pos" we have $\mathbb{E}[pos] = 1/2$
IMHO this last assumption might be completely unrealistic (in real life). E.g. if positive means a person having a rare-ish disease, then it's hard to imagine a real-life scenario where $pos \sim Unif(0,1)$ is a good assumption - instead $pos$ will usually be a r.v. highly concentrated near $0$.
Hope this makes sense?