Combining Weak Classifiers

40 Views Asked by At

In a binary classification problem, assume we have five classifiers with accuracies 0.4, 0.5, 0.6, 0.7, and 0.8. The errors of these individual classifiers are independent. What is a good decision function that involves combining the binary (+1, -1) outputs of each component classifier? Why?

Intuitively, my thinking is something like sign(-0.6w1 + 0.5w2 + 0.6w3 + 0.7w4 + 0.8w5) might work. How would I prove it is optimal to selecting the 0.8 accuracy classifier though? My issue is this... if w1, w2, and w3 agree then they can overrule w4 and w5 agreeing. But the probability of both w4 and w5 being incorrect is 0.3 * 0.2 = 0.06 is less than 0.4 * 0.5 * 0.4 = 0.08 (the other three being incorrect). If the probability of w4 and w5 being incorrect when they agree is less, shouldn't they not be overruled?

1

There are 1 best solutions below

0
On BEST ANSWER

Let $Y$ be the true value and $X_i$ the predictions of the five classifiers. Then Bayes gives $$ P(Y=y\mid X_i=x_i)= \frac{P(X_i=x_i\mid Y=y)P(Y=y)}{\sum_{y'}P(X_i=x_i\mid Y=y')P(Y=y')}.$$ Then we have $$ \frac{P(Y=1\mid X_i=x_i)}{P(Y=-1\mid X_i=x_i)}=\frac{P(X_i=x_i\mid Y=1)P(Y=1)}{P(X_i=x_i\mid Y=-1)P(Y=-1)}.$$

We should guess $Y=1$ when this is greater than $1,$ so when $$ P(X_i=x_i\mid Y=1) P(Y=1)>P(X_i=x_i\mid Y=-1)P(Y=-1).$$ If we assume flat priors, i.e. $P(Y=1)=P(Y=-1)=1/2,$ then we can cancel those factors. We have $$ P(X_1=1\mid Y=1)=p_1=0.4$$ and $$P(X_1=1\mid Y=-1) = 1-p_1.$$

So if, for example $X_1=1,$ $X_2=-1,$ $X_3=1,\ldots,$ we have $$ P(X_i=x_i\mid Y=1) = p_1(1-p_2)p_3\ldots\\P(X_i=x_i\mid Y=-1)= (1-p_1)p_2(1-p_3)\ldots,$$ so we want to predict $Y=1$ when $$ \frac{p_1}{1-p_1}\frac{1-p_2}{p_2}\frac{p_3}{1-p_3}\ldots > 1,$$ or, taking logs $$ \log\left(\frac{p_1}{1-p_1}\right) - \log\left(\frac{p_2}{1-p_2}\right)+\log\left(\frac{p_3}{1-p_3}\right)+\ldots > 0.$$

So, catching onto the pattern, you want your decision function to be $\operatorname{sign}(\sum_i w_i x_i)$ where $$ w_i = \log\left(\frac{p_i}{1-p_i}\right).$$ So, for instance, the weight of the best predictor here is $$ \log(0.8/0.2) = \log(4)\approx 1.386$$ and the weight of the worst one is $$ \log(0.4/0.6) \approx -0.405$$ (as you seem to have understood, you should assign a negative weight to a predictor that is wrong more often than not).