Finding elements far from mean

135 Views Asked by At

I'm designing a program to segment a scanned document into individual characters (for character recognition). When I have a list of segmented characters, I know their dimensions and by extension their areas. I want to remove elements from this list which are well outside what is expected. For example, I might have these elements:

$$e(area) = 0(164), 1(172), 2(188), 3(160), 4(12), 5(158), 6(160), 7(4)$$

From this list, elements $4$ and $7$ have areas so small compared to the rest that they can be removed. Elements like these might be noise accidentally segmented into a character, or something like a pen slipped when writing.

What methodology would be best to eliminate elements from a list which are so far from the mean?

I cannot use "remove lowest % of elements" as this can remove valid elements (eg. in the absence of supersmall elements), nor can I use "remove elements with smaller than n area", as the area of characters can vary wildly.

Can standard deviation be used in this manner? Does that work how I think it works? "remove elements outside 3 standard deviations from the mean"? In my testing, these supersmall elements comprise roughly 0~1.2% of any given document.

1

There are 1 best solutions below

3
On BEST ANSWER

I would suggest using the boxplot method of identifying outliers. Begin by finding the 'five-number summary' of your data: these are (in order of increasing size) the minimum, lower-quartile $Q_1$, median, upper-quartile $Q_3,$ and maximum. The quartiles and the median cut the sorted sample into four 'chunks' of approximately equal numbers of values. The 'interquartile range' is $\text{IQR} = Q_3 - Q_1.$

Boxplot outlier rules. An outlier rule designates as 'outliers' values below $Q_1 - k\text{IQR}$ and above $Q_2 + k\text{IQR},$ for some appropriately chosen $k.$ These boundaries, called 'fences' are used to find outliers, but are not explicitly shown in a boxplot.

Commonly chosen values are $k = 1.5$ (for so-called 'possible outliers') and $k = 3$ (for 'probable outliers'). Also, $k = 2.25$ has been suggested because, for normal data, it tends to tag as outliers approximately the same percentage of values across small and moderate numbers of observations $n.$

Data for an example. I may be on dangerous ground here, but I am guessing that the sample size of your example ($n= 8$) is much smaller than in a realistic example and the the proportion of 'contamination' with bad values (25%) is larger than in a realistic example. So I will use simulated data that are 90% from $\mathsf{Norm}(\mu = 170,\, \sigma = 12)$ with 10% contamination ffom $\mathsf{Norm}(\mu = 15,\, \sigma = 2).$ Here are $n = 100$ such values. I have rounded them to integers, sorted them, and found the quartiles and the interquartile range IQR:

x.ok = rnorm(90, 170, 12);  x.out = rnorm(10, 15, 2);  x = round(c(x.ok, x.out))
sort(x);  quantile(x);  IQR(x)
x
##  [1]  12  13  14  15  15  15  16  17  17  17 141 145 146 146 148 149 150 150 152 153
## [21] 155 156 157 160 160 160 162 162 163 164 164 164 164 164 165 165 166 166 167 167
## [41] 168 168 168 168 168 169 170 170 170 171 171 171 171 172 172 172 173 173 173 173
## [61] 174 174 174 174 175 176 176 177 177 177 177 178 178 178 178 178 178 178 179 179
## [81] 179 180 181 182 182 184 184 184 185 185 186 187 189 190 190 191 191 193 196 203
quantile(x);  IQR(x)
##  0%  25%  50%  75% 100%  # five number summary 
##  12  160  171  178  203 
##  18                      # IQR

par(mfrow=c(1,3))
boxplot(x, col="red", main="Outlier Rule: 1.5 IQR")  # k=1.5 is default
boxplot(x, range=2.25, col="blue2", main="Outlier Rule: 2.25 IQR")
boxplot(x, col="green2", range=3, main="Outlier Rule: 3 IQR")
par(mfrow=c(1,1))

enter image description here

For my fake data the contamination is sufficiently rare (10%) and extreme (good values averaging about 7 standard deviations above bad ones) that boxplots with all three criteria (choices of $k$) correctly label the ten 'contamination' values as outliers.

Notice that the 'boxes' in boxplots extend from $Q_1$ to $Q_3$ (height IQR). The 'whiskers' extend upward and downward to the most extreme non-outlier values. Outliers are plotted individually.

You need to know that boxplot criteria do identify some extreme values in pure normal data (no contamination) as outliers. In samples of size $n = 100:$ on average, almost 1 per sample, for $k = 1.5;\;$ about 0.05, for $k=2.25;\;$ about 0.003, for $k=3.\;$ So use $k = 3$ when possible.


Breakdown. By contrast, if the 'good' observations are from $\mathsf{Norm}(150,15),$ and 20% contaminated with observations from $\mathsf{Norm}(35,5),$ then the criterion with $k=3$ begins to 'break down', possibly not identifying all of the 'bad' observations.

enter image description here

For your data, you might want to explore which value of $k$ seems to do the best job of itentifying 'contamination' values. However, for the small sample of eight values, boxplots do not detect the values 4 and 12 as outliers; the contamination is so prevalent as to stretch the IQR to the point where the criterion is useless.


Note: If you do not find that boxplot outlier rules 'work' to find outliers in your kind of data, here is another method that is sometimes used: (a) Remove one value at a time from the sample; (b) find the average $a$ and standard deviation $s$ of the remaining $n-1$ values; (c) discard the 'removed' value if it is not within the interval $a \pm 2s$; (d) iterate for all $n$ observations. (If you use mean and SD of the whole sample, they are distorted by bad observations, and results won't reject enough bad observations.)

Also, you may want to read some of the literature on methods of 'exploratory data analysis' (EDA) and 'robust estimation' to see if you can find more suitable methods.

However, if contamination values are too close to legitimate ones or if the percentage of contamination is high, it will be very difficult to find any method that can identify all or almost all contamination values.