Sum of normally distributed random variables constrained to a fixed range

95 Views Asked by At

Consider $X$ independent normally distributed random variables with given Means $m_{x}$ and Standard Deviations $s_{x}$.

For example lets say we have 5 of these random variables.

I know the sum of these variables is also normally distributed, with

mean $= m_{1} + m_{2} + m_{3} + m_{4} + m_{5}$, and

standard deviation $= \sqrt{s_{1}^2 + s_{2}^2 + s_{3}^2 + s_{4}^2 + s_{5}^2}$

Now lets say we will only allow the the original 5 variables to give values within a specific range, for example between the positive integers $y$ and $z$. That is, we change any values less than $y$ to equal $y$, and any greater than $z$ to equal $z$.

This however now means the formula for the sum of the variables is inaccurate, as it still believes they can take values outside of the given range.

I will give a concrete example to help:

Consider 5 normally distributed random variables $x_{1}$, $x_{2}$, $x_{3}$, $x_{4}$ and $x_{5}$.

The means of these variables are $m_{1} = 80.5 $, $m_{2} = 85.5 $, $m_{3} = 90.5 $, $m_{4} = 95.5 $ and $m_{5} = 100.5 $ respectively,

and their standard deviations $s_{1} = 10 $, $s_{2} = 11 $, $s_{3} = 12 $, $s_{4} = 13 $ and $s_{5} = 14 $.

Now let's say the values given by these variables will be constrained between 50 and 150, with any values outside of this range being change to the nearest value within it, as above.

If we want to work out the probability of the sum of the values given by the variables being above 500.5, we cannot just use the usual formula as discussed previously, since it will consider the probability of the variables returning values below 50 or above 150.

What can be done to get a more accurate probability in this case?

1

There are 1 best solutions below

0
On

Dealing with this analytically is going to be hard, so you may find simulation gets you to a reasonable answer more quickly.

With your numerical example, it will make little difference: only about $0.27\%$ of your sums will be affected at all and most of those not by much. Those that are affected are more likely to be increased by the censoring than reduced, since most of your means are closer to the lower end.

Without censoring, it is easy enough to find the probability the sum exceeds $500.5$ is $0.03782$. For example with R:

m <- c(80.5,85.5,90.5,95.5,100.5)
s <- c(10,11,12,13,14)
msum <- sum(m)
ssum <- sqrt(sum(s^2))
1 - pnorm(500.5,msum,ssum) # theoretical before censoring 
# 0.03782036 

Simulating the censoring for $1$ million sums could give

set.seed(2023)
sims <- matrix(rnorm(5*10^6,m,s),nrow=5)
censoredsims <- ifelse(sims < 50, 50, ifelse( sims > 150, 150, sims))
beforesums <- colSums(sims)
aftersums <- colSums(censoredsims)
table(beforesums > 500.5)
#  FALSE   TRUE 
# 962434  37566
table(aftersums  > 500.5)
#  FALSE   TRUE 
# 962445  37555
table(beforesums > 500.5, aftersums > 500.5)
#         FALSE   TRUE
#  FALSE 962434      0
#  TRUE      11  37555

so (allowing for simulation noise) this demonstrates the theoretical calculation before censoring was correct. It also suggests that perhaps the additional probability above $500.5$ due to the censoring could roughly be of the order or $\frac{11}{1000000}$, and the probability of falling below due to censoring would be even smaller (too small to be seen in the simulation). Personally I would make this adjustment to the theoretical probability and say the probability after censoring might be about $0.03781$ with some uncertainty in the last decimal place.