Beforehand: Im not advanced in mathematics, but Im stuck on a problem I need to solve for my masters thesis.
I broke it down to the following situation:
I have a pool of possible values (105000), ranging from 0 to 200. I am picking 866 of these values. Now I want to define an area, in which the sum of the picked values lies with a certain probability, lets say 75%.
So far I calculated the average value (94) and created kind of a box-plot, so what I know is: 15% of the data is lower than 17, 25% is lower than 27, 50% is lower than 70, 75% is lower than 110 and 85% is lower than 140. The average value is 94. And I am picking 866 of those values. It reminds me of a "ball picking without replacement" situation, but that doesnt really help me...
I hope you understand my problem.
So you have determined an histogram of the cumulative probability of the data. From that you can roughly estimate the variance. Or ,maybe you can compute that directly from the data.
Once you have the mean and the variance, $868$ is quite a large quantity to allow you to use the Central Limit theorem. Also $868$ is small enough wrt to the total amount of data, that you can consider each picking to be independent of the others.
(you may disregard the "sampling without replacement" effect, if you are going to do so).
By normalizing your data concerning the CDF plot, this turns out to be quite near to a Uniform Distribution. from that I get a mean of $ \approx 78.9$ and a variance of $\approx 2921.7$, which corresponds to a stdv of about $54$.
Now the Central Limit theorem tells us that the sum of the $868$ pickings will be approximately distributed normally, with a
mean $\mu = 868 \, \cdot \, 78.9 = 68,485$
and a variance $\sigma ^2 =868 \, \cdot \, 2921.7=2,536,036$
i.e. with a stdv of about $1592$.
--- reply to your comments ---
a) ${\cal N}\left( \mu, \sigma ^2 \right)$ will approximate the distribution of the sum quite well, except on the tails.
Since the Normal has infinite tails, while your sum is constrained within $[0, 868 \cdot 200]$ which is approximately $[ \mu \pm 40 \cdot \sigma]$, then you can just trim the $\cal N$ accordingly, and practically you do not need to re-normalize it after the trimming.
b) The approximation works either as CDF and as PDF. I did not get if your data are continuous or discrete.
If discrete, than approximate with a continuous interval $[s \pm 1/2]$.
c) How good is the approximation with the normal ?
Well it depends of course on:
- the number of variables summed;
- how "far from normal" are the addends (yours are almost uniform);
- how different from each other are the variances of the addends (yours have the same variance).
There are plenty of articles dealing on the topic of bounding the resulting error, and it not the case to reproduce them here.
d) If you have to deal also with the case that the addends be much less that $868$, say of the order of $10$, then you may consider an alternative approach, starting from the assumption that the data distribution can be well approximated as being uniform.
You can then use the Irwin-Hall distribution which is in fact relevant to the sum of Uniform continuous variables, and which can be well extended to discrete ones, if the base interval is large enough (even about $10$ vs. the $200$ you have), provided that you "center" the discrete values into the continuos range (the $\pm 1/2$ cited above).
e) You do the normal approximation above for three different sets of data, obtaining three different $\cal N$'s.
"If I now sum up my three individual sums, Do I just multiply the possibilities, to get the complete probability for that specific outcome?"
It depends on wich process you want to simulate.
If $s_1, s_2 , s_3$ are always obtained by summing that number of pickings from that specific population, then $S= s_1 + s_2 + s_3$ will be surely approximated by $${\cal N}\left( \mu_1 + \mu_2 + \mu_3 \, , {\sigma_1} ^2 + {\sigma_2} ^2 + {\sigma_3} ^2 \right)$$.