Determine probability areas of large data set

50 Views Asked by At

Beforehand: Im not advanced in mathematics, but Im stuck on a problem I need to solve for my masters thesis.

I broke it down to the following situation:

I have a pool of possible values (105000), ranging from 0 to 200. I am picking 866 of these values. Now I want to define an area, in which the sum of the picked values lies with a certain probability, lets say 75%.

So far I calculated the average value (94) and created kind of a box-plot, so what I know is: 15% of the data is lower than 17, 25% is lower than 27, 50% is lower than 70, 75% is lower than 110 and 85% is lower than 140. The average value is 94. And I am picking 866 of those values. It reminds me of a "ball picking without replacement" situation, but that doesnt really help me...

I hope you understand my problem.

2

There are 2 best solutions below

11
On BEST ANSWER

So you have determined an histogram of the cumulative probability of the data. From that you can roughly estimate the variance. Or ,maybe you can compute that directly from the data.

Once you have the mean and the variance, $868$ is quite a large quantity to allow you to use the Central Limit theorem. Also $868$ is small enough wrt to the total amount of data, that you can consider each picking to be independent of the others.
(you may disregard the "sampling without replacement" effect, if you are going to do so).

By normalizing your data concerning the CDF plot, this turns out to be quite near to a Uniform Distribution. from that I get a mean of $ \approx 78.9$ and a variance of $\approx 2921.7$, which corresponds to a stdv of about $54$.

Now the Central Limit theorem tells us that the sum of the $868$ pickings will be approximately distributed normally, with a
mean $\mu = 868 \, \cdot \, 78.9 = 68,485$
and a variance $\sigma ^2 =868 \, \cdot \, 2921.7=2,536,036$
i.e. with a stdv of about $1592$.

--- reply to your comments ---

a) ${\cal N}\left( \mu, \sigma ^2 \right)$ will approximate the distribution of the sum quite well, except on the tails.
Since the Normal has infinite tails, while your sum is constrained within $[0, 868 \cdot 200]$ which is approximately $[ \mu \pm 40 \cdot \sigma]$, then you can just trim the $\cal N$ accordingly, and practically you do not need to re-normalize it after the trimming.

b) The approximation works either as CDF and as PDF. I did not get if your data are continuous or discrete.
If discrete, than approximate with a continuous interval $[s \pm 1/2]$.

c) How good is the approximation with the normal ?
Well it depends of course on:
- the number of variables summed;
- how "far from normal" are the addends (yours are almost uniform);
- how different from each other are the variances of the addends (yours have the same variance).
There are plenty of articles dealing on the topic of bounding the resulting error, and it not the case to reproduce them here.

d) If you have to deal also with the case that the addends be much less that $868$, say of the order of $10$, then you may consider an alternative approach, starting from the assumption that the data distribution can be well approximated as being uniform.
You can then use the Irwin-Hall distribution which is in fact relevant to the sum of Uniform continuous variables, and which can be well extended to discrete ones, if the base interval is large enough (even about $10$ vs. the $200$ you have), provided that you "center" the discrete values into the continuos range (the $\pm 1/2$ cited above).

e) You do the normal approximation above for three different sets of data, obtaining three different $\cal N$'s.
"If I now sum up my three individual sums, Do I just multiply the possibilities, to get the complete probability for that specific outcome?"
It depends on wich process you want to simulate.
If $s_1, s_2 , s_3$ are always obtained by summing that number of pickings from that specific population, then $S= s_1 + s_2 + s_3$ will be surely approximated by $${\cal N}\left( \mu_1 + \mu_2 + \mu_3 \, , {\sigma_1} ^2 + {\sigma_2} ^2 + {\sigma_3} ^2 \right)$$.

4
On

If I understood correctly, you have $N=105000$ real numbers that are in $[0,200]$ and you want to find an interval $[a,b]$ such that $p=0.75=75\%$ of the numbers lie in that interval.

If that's the problem, then it is actually simple, just order the numbers and see what's the $pN=78750$ number, let's call this number $r_{75}$. Then $75\%$ of your numbers lie in $[0,r_{75}]$.

It's important to notice that there are many other intervals with this property, the one I gave you was the easiest to build, but depending on what you need you may want to find some symmetric interval around the mean or the shortest possible interval or something else.