Find best specified range within sequence of numbers

26 Views Asked by At

What is the best formula to find the most popular 'spread' within a sequence with a set range available.

E.g I have 100 numbers and I have to find the most popular 'range' of 10, but the range/band cannot be spread out.

So for a smaller example below I have:

3 1 2 1 2 1 3 4 1 1 1 2 1

And the range needs to be 5, so the range in this example would be 21341 as it totals 11, more than any other range.

1

There are 1 best solutions below

0
On

Ideas:

Probably, @ParclyTaxel is right that a brute-force scan of the possible intervals is an efficient, if slightly boring way to do it. Suppose observations are integers, the smallest observation is above 0 and the largest is below 100. Check how many outcomes are in each interval: 1 through 10, 2 through 11, ..., 91 through 100. A program wouldn't be difficult to write.

Perhaps a more interesting way is to look at a kernel density estimator of the data, find what it judges to be the 'mode' and start your search there.

If data are roughly $\mathsf{Norm}(\mu=50, \sigma=1),$ then maybe the highest concentration is around 50. I've included a histogram and tick marks for each observation in case they provide additional visual clues. (Program in R statistical software.)

set.seed(228)
x = rnorm(100, 50, 10)
hist(x, prob=T, main="")
lines(density(x), lwd=2, col="darkgreen")
abline(h=0, col="blue2", main="")
rug(x) 

enter image description here

Yes, it seems best to search intervals of width 10 around 50.

If data are from $\mathsf{Gamma}(shape = 5, rate=1/3)$ then maybe look in the vicinity of $\frac{5-1}{1/3}= 12.$

set.seed(229)
x = rgamma(100, 5, 1/3)
hist(x, prob=T, main="")
lines(density(x), lwd=2, col="darkgreen")
abline(h=0, col="blue2", main="")
rug(x) 

enter image description here

You can find the mode of the density estimator without looking at a graph:

den.inf = density(x)
den.inf$x[den.inf$y==max(den.inf$y)]
## 12.01986

Oh, wow! Lucky again. Of course, if you don't know the distribution ahead of time, you won't be able to make informed guesses, but you can still use the density-estimation method to know where to search.

I don't know if your Question is a theoretical exercise, a computer assignment, or a serious practical issue about finding concentrations of observations in big data. If the latter, then maybe looking at density estimators is better than scanning intervals of length 10 and counting.

References: For more about kernel density estimators, see the Wikipedia article, and maybe B. Silverman's book referenced there. Also, this page.