How to know when samples tend toward lowest number?

36 Views Asked by At

I am working on a computer simulation. In a perfect world, I only want to know what the minimum result value is from the simulation. (I have no a-priori knowledge about the range of numbers or their distribution.) I can take as many samples as I like, but it would be handy to know when I have a statistically "good" answer. I will never know the perfect answer, but I am thinking there is probably a good way to know that my samples are not generally returning smaller numbers than those which have been previously seen (the samples are unordered, of course).

The information I have found online seems to talk about knowing that you are near a mean, have a good spread of samples, etc...but what I really want to know is that I have some confidence that I have already simulated my lowest possible result and continuing to run, while I MAY find a lower result, is not significantly likely and probably wouldn't be significantly lower than the lowest I have already found.

I am computing some statistics now to include mean, standard deviation, variance, skewness, and kurtosis, but as I am not a stats person by trade. I don't want to interpret these numbers wrong and make a bad assumption. Any thoughts or ideas here would be appreciated!

1

There are 1 best solutions below

0
On BEST ANSWER

In statistical terms, you're drawing $n$ samples from a set $X$, i.e. you have $n$ independent random variables $X_1,\ldots,X_n$, all with distribution $D$ over $X$, and want to estimate $\min X$ from that.

To determine the quality of your estimate, you'll need to assume something about that distribution $D$. First and foremost, $D$ needs to have a minimal element, which in particular means that you cannot assume $D$ to be normally distributed.

Uniform Distributions

Say, for example, that $D$ is the uniform distribution over $X = [a,b]$. Then the probability that $\min\{X_1,\ldots,X_n\} > a+\epsilon$, i.e. that your estimate is off by more than $\epsilon$, is the probability that $X_1,\ldots,X_n > a + \epsilon$. For a single $X_i$, $\mathbb{P}(X_i > a + \epsilon) = 1 - \frac{\epsilon}{b-a}$, and since the $X_i$ are independent, you get that $$ \mathbb{P}\left(\min\{X_1,\ldots,X_n\} > a + \epsilon\right) = \left(1 - \frac{\epsilon}{b-a}\right)^n \text{.} $$ Thus, to ensure that the probability of the error exceeding $\epsilon$ is smaller than $\alpha$, you need to have $$ \left(1 - \frac{\epsilon}{b-a}\right)^n < \alpha $$ i.e. take at least $$ n = \frac{\log \alpha}{\log \left(1 - \frac{\epsilon}{b-a}\right)} $$ samples.

Arbitrary Distributions with known $\mathcal{P}(X_i > \min X + \epsilon)$

The same ideas works for arbitrary distributions $D$ as long as you know (or can guess) the probability $p$ that one sample yields a "good" estimate, i.e. if you know $$ p = \mathbb{P}(X_i > \min X + \epsilon) \text{.} $$ Again due to the independence of the samples, the probability of $n$ samples yielding a "good" estimate is then $$ \mathbb{P}\left(\min\{X_1,\ldots,X_n\} > a + \epsilon\right) = p^n $$ meaning, as before, that you need to take at least $$ n = \frac{\log \alpha}{\log p} $$ samples to make the probability that the estimation error exceeds $\epsilon$ smaller than $\alpha$.

Arbitrary Distributions

If you don't know a suitable $p$, you can attempt to guess it from your samples. If $M_n$ is the minimum of the first $n$ samples, set $$ \hat p_n = \frac{1}{n}\left|\left\{i \,\big|, 1 \leq i \leq n, X_i < M_n + \epsilon\right\}\right| \text{ where } M_n = \min \{X_1,\ldots,X_n\} \text{,} $$ i.e. $\hat p_n$ is the percentage of the $X_1,\ldots,X_n$ that are deviate less than $\epsilon$ from the minimum. You can then compute $$ \hat q_n = \frac{\log \alpha}{n\log \hat p_n} \text{.} $$ By the previous section, $\hat q_n \leq 1$ if the probability of $M_n$ exceeded the true minimum is less than $\alpha$, assuming that $\hat p_n$ is the true probability that $X_i > \min X + \epsilon$. Note $\hat q_n$ is only really defined once $\hat p_n > 0$, i.e. once you reach an $n$ for which some $i \leq n$, $X_i > M_n + \epsilon$. But that's a good safeguard against stopping too early anyway - just produce samples until $\hat p_n > 0$, then stop once $\hat q_n \leq 1$. This assumes, however, that you pick a reasonably small $\epsilon$ - if you pick it so large that $\mathbb{P}(X_i > \min X + \epsilon) = 0$, you will produce samples forever almost certainly.

You could probably improve this further by modifying $\hat p_n$ to account for the uncertainity of the estimate for small $n$. Since a smaller $\hat p_n$ makes it less likely to stop at a certain $n$, such a correction would need to decrease $\hat p_n$ for small $n$, and decrease less as $n$ gets larger.

Be warned that I haven't checked if this is really statistically sound for any distribution $D$.