Combining discrete empirical distributions

511 Views Asked by At

I'm running some simulation software and I'm currently in a bit of a pickle;

If I have two sets distributions, the first giving varying probability of 1...n events occurring, and the second giving the probability that each event takes a certain amount of time (i.e. 10/20/30 mins), how exactly would I combine these to create a distribution that gives the total amount of time spent?

For example, if the we randomly obtain 2 from the first distribution, the total time would be (10/20/30) + (10/20/30) minutes depending on probability.

How would I generalise this for the entire set of data?

1

There are 1 best solutions below

0
On

It seems to me you'd do best to model this as a random sum of random variables. Here is an example.

Suppose a search procedure requires $N + 1$ searches, where $N \sim Pois(\lambda=5).$ Also, times for individual searches can be modeled as $X_i \sim Exp(rate = 1/50)$ so that $E(X_i) = \mu = 50$ units of time. Then the total length of time for a search is $T = \sum_{i=1}^{N+1} X_i.$ You may be interested in $P(T > 600).$

Assuming that $N$ and the $X_i$s are all independent, standard probability formulas give $E(T) = (\lambda + 1)\mu = 300,$ which is straightforward, and $$V(T) = \sigma_T^2 = E(N+1)\sigma_X^2 + V(N+1)\mu_X^2,$$ so $\sigma_T = 165.8312,$ which may be surprising because of the second term for the variance.

A simulation in R shows that $P(T > 600) \approx 0.0533 \pm 0.0004.$ A normal approximation based on mean 300 and SD 165.83 is inappropriate because the distribution of $T$ is skewed. But my main concern is that trying to combine summarized values of $N$ and $X$ from observed data may not give you a realistic view of the actual variability of $T.$

m = 10^6;  lam=5;  bet = 1/50;  t = numeric(m)
for (i in 1:m) {
 n = rpois(1, lam);  t[i] = sum(rexp(n+1,bet))
}
mean(t);  sd(t);  mean(t > 600)
## 299.7921  # aprx E(T) = 300
## 165.8158  # aprx SD(T) = 165.83
## 0.053278  # aprx P(X > 600)

Here is a histogram of the simulated distribution of $T$ along with a (preposterously bad) normal density "fit."

enter image description here