Distribution of the sum of many lognormal random numbers from same distribution

1.6k Views Asked by At

In my application I have to sum up a lot (between 1000 and 2000) lognormally distributed random numbers and use their sum. All random numbers that I sum up follow the same distribution. The current bottleneck is the random number generation, so instead of generating one thousand random numbers and summing them together, I would like to create one number that follows the distribution of the sum of one thousand random numbers of a certain lognormal distribution. I have looked at approaches that describe the distribution of the sum of lognormals, but I did not find any that works for a large amount of summands.

So I remembered that the sum of two random numbers $x_1$ and $x_2$ follows a distribution $f(x)$ whose probability density function (pdf) is the convolution of the pdfs of the summands (See this reference for example.)

I convolve the pdf of the original distribution with itself several times. I do this recursively to avoid long computation times, like this (Matlab code):

% Define lognormal disribution
mu = 0;
sigma = 1;
% create pdf
maxx = 10;
fsteps = 10000;
x = linspace(0.0000001,maxx,fsteps); 
f = 1./(sqrt(2*pi)*sigma*x) .* exp( -((log(x)-mu).^2)./(2*sigma^2)  );
% normalize
f = f / sum(f);
% convolve with itself
f = conv(f,f); % distribution of sum of 2 
f = conv(f,f); % distribution of sum of 4
f = conv(f,f); % distribution of sum of 8
f = conv(f,f); % distribution of sum of 16
f = conv(f,f); % distribution of sum of 32 
f = conv(f,f); % distribution of sum of 64 
f = conv(f,f); % distribution of sum of 128 

The pdf $f(x)$ converges to something close to a Gaussian. This is no big surprise, because that is what I had expected when convolving a positive signal with itself several times. (see also Steven W. Smith: "The Scientist and Engineer's Guide to Digital Signal Processing", p. 135)

But if I look at the distribution of the sum of 128 lognormal random numbers from the same distribution, it does not look like a gaussian. I do it like this (includes comparison to plot of convolution product):

N = 1000000; % use a buffer of one million random numbers
r = zeros(N,1);
% sum 128 buffers of lognormally distributed random numbers together
for i = 1:128 
    r = r + exp(mu + randn(N,1) * (sigma));
end
% Plot histogram and convolution product
% Calculate resolution of original pdf
xRes = maxx / fsteps;
% Apply resolution to indices of convolution product
xf = (1:numel(f)) * xRes;
% ... and use x-coordinates of conv. product as bin centers
h = hist(r,xf); 
hold off;
plot(xf,h / max(h));
hold all;
plot(xf,f / max(f)); 

But the histogram (blue) looks skewed and its peak is not at the same location as the peak of the pdf (green) calculated by convolution (see image below).

Histogram and PDF of sum of lognormal random numbers

I tried to use a bigger resolution for the pdfs I convolve, because I suspected that especially the resolution for the left part of the curve might be to low, but that did not change anything. I also tried to use random numbers from random.org (they are generated from atmospheric noise) to make sure that there is no problem in the random number generation. (I used the Box-Muller transform to create normally distributied random numbers from uniformly distributed random numbers.), but results were still the same. And I tried to increase the accuracy of the calculation of the sum (using a multi-precision library for matlab), but that did not change anything either.

My question is: Why are the two not the same?

1

There are 1 best solutions below

0
On

If it's any consolation, I struggled with this question for a few hours. The solution is strikingly simple.

The source of the mismatch lies in the initial declaration of x, specifically the upper bound. If you extend the upper bound from 10 to 100 (even possibly less) in this example, you will see a much better agreement between the convolution (green) and the randomly generated array (blue).

Conceptually, it makes sense as well. By limiting x to 10, you are excluding potentially very large values, which ,although occur very rarely, can have a significant impact when they do.

Due to my computer's limitation, I replicated a much smaller version of your code, but I am fairly confident that was the source of error.