Voting system with probability

127 Views Asked by At

Three independent algorithms are executed in parallel. The role of each algorithm is to give an answer (Yes or no) with a certain probability to a certain number of questions (say 100).

Example:

Question 3:: Is this a car?

  • Algo1: Yes (0.7 sure) => P1
  • Algo2: Yes (0.65 sure) => P2
  • Algo2: Yes (0.2 sure, which is also 0.8 No) => P3

And P = f(P1, P2, P3) where f() is a function.

If I want to proceed with a voting process where the final probability P is affected by the majority. Meaning P is high when most of the answers (3 answers in this question) are high, and low otherwise, what is the expression of the function f()?

PS:

  • I have tried a simple mean formula (average), but I don't feel that's enough or reasonable since the mean formula is affected by the max and min values.
  • I am not explicitly/necessarily trying to compute an average value. The important thing is that P should represent the majority votes (more precise and "correct")
2

There are 2 best solutions below

2
On

There might be better ways from people who have actually studied such questions, but I'd do this as follows. I'm assuming, as in your example, that you get answers in the the range from 0 to 100, 100 meaning definite "yes" to the question, 50 meaning "don't know" and 0 means a definite "no" to the question.

Then do the following

1) Order the results ($p_1,p_2,p_3)$ in ascending order: $r_1 \le r_2 \le r_3$.

2) Take the function

$$w(r) = \begin{cases} 0, & \text{if $r < 40$} \\ 0.05r-2, & \text{if $40 \le r \le 60$} \\ 1, & \text{if $r > 60$} \end{cases} $$

3) Calculate the weigthed average:

$$f(r_1,r_2,r_3) =\frac{1-w(r_2)}2r_1 + 0.5r_2 + \frac{w(r_2)}2r_3$$


Reasoning: By ordering the results, $r_2$ becomes the decision between leaning to "yes" or "no". If $r_2 > 50$, you are leaning to "yes", if $r_2 < 50$, you are leaning to "no".

You may have a clear 'consensus of two', which is codified in $w(r)$ as $r_2 > 60$ (clear 'yes' by two algorithms) or $r_2 < 40$ (clear 'no' by two algorithms). Or you may have a greay area, where $r_2$ is near 50.

In the consensus case, the formula I gave comes out as the average of the two consenus opinions: If $r_2 < 40$, then $f(r_1,r_2,r_3)=\frac{r_1+r_2}2$. If, OTOH, $r_2 > 60$, then $f(r_1,r_2,r_3)=\frac{r_2+r_3}2$.

The problem with applying this formula strictly for $r_2<50$ and $r_2>50$ is that it becomes non-continuous when $r_2$ crosses 50. For example, if $r_1=0, r_2=49, r_3=100$, applying the average of the consensus votes ("no") would result in $\frac{r_1+r_2}2=24.5$ If $r_2$ changes slightly to $r_2=51$, the consesus vote changes to "yes", so the avarage of the consensus votes would be $\frac{r_2+r_3}2=75.5$, which is a big change.

In the middle area of $w(r)$ ($40 \le r_2 \le 60$), it gives a weight to both extreme answers ($r_1$ and $r_3$), which takes into account that we don't really habe a consensus. As $r_2$ changes from 40 to 60, the weight shifts gradually from $r_1$ to $r_3$, making sure the function $f$ is continuous.

3
On

Generally speaking, NNs are better at classification than regression, so I'm assuming the ground truth is something like $[1,0]$ for the cars and $[0,1]$ for everything else (easily done with something like keras.utils.to_categorical) and your wanted output is $[P,Q]$ where $P$ is the confidence of "this is a car", $Q$ is the confidence of "this is not a car". As loss function, cross-entropy sounds right.

A somewhat generalization of what Ingix said, is to have a final fully connected layer with softmax activation: the idea is to have $P = aP_1 + bP_2 + cP_3$ with $a,b,c$ trainable parameters. To really generalize his function, you can try to write a "piecewise linear" layer (which transforms $r$ into $w(r)$) followed by a fully connected layer.

Another idea is, if the three NNs have different structures (so they are extracting different but comparable features), to use them as networks-in-a-network (like Su et al. did in their first multiview paper or Lu et al. in their multipatch paper): the idea isn't to "agglomerate" the output probabilities of the networks, but to "agglomerate" the features extracted by them and reason on the new "concatenated/pooled" set of features.