I have two datasets (A & B). They each have 1000 numbers.
99% of the time: A < x <= B
However, 1% of the time B < x < A.
How can I solve for x, where x has the highest probability of separating the two groups.
Obviously Max(A) and Min(B) are misleading because there are occasional anomalies (<2%). Can you help me determine the optimum "x" with the highest probability on both sides?
Sample Dataset
A 1
A 1
A 1
A 2
B 2 <--anomoly
A 3
A 3
A 3
A 4
A 5 <--anomoly
B 5 <--division, or `x`
B 5
B 5
B 5
A 6 <--anomoly
B 7
B 8
B 8
B 8
B 9
B 9
B 10
B 10
Like the sample you have given, sort the datasets in ascending order. If there are both A's and B's for any same number, sort all those A's before those B's.
Now, we have a sequence of 1000 A's and 1000 B's mixed together. Initialise an error counter $e=1000$.
Sequentially scan the sequence. If an A is encountered, decrease $e$ by 1. If a B is encountered instead, increase $e$ by 1.
Also, store the split point that minimises $e$ throughout the scan. The corresponding number value, which equals to the average of the number values before and after the split, will be the boundary that minimises number of mis-classification.