Non Normal Probability Distribution

506 Views Asked by At

Sorry if this has been answered, I lack the mathematical jargon to articulate what I want to achieve in search engines.

I am trying to model a probability curve of the number of days from invoice to payment.

So for example, I have 343 invoices and I've created a list of how many days it took to receive payment on each one:

[5,    5,    7,    7,   12,   16,   16,   19,   20,   20,   20,
25,   25,   27,   27,   27,   28,   30,   30,   31,   31,   31,
31,   32,   35,   36,   36,   39,   39,   42,   42,   42,   42,
45,   47,   48,   48,   48,   48,   48,   49,   49,   51,   51,
51,   52,   52,   53,   53,   53,   56,   57,   58,   58,   58,
58,   59,   61,   62,   62,   63,   64,   64,   66,   66,   70,
70,   72,   72,   73,   73,   74,   75,   77,   77,   78,   82,
83,   84,   85,   85,   86,   88,   89,   90,   91,   91,   92,
92,   92,   92,   94,   95,   95,   96,   97,   97,  100,  102,
103,  103,  106,  108,  108,  109,  111,  112,  113,  116,  117,
119,  119,  122,  122,  122,  122,  122,  124,  125,  129,  130,
130,  132,  132,  133,  133,  134,  135,  137,  138,  139,  139,
140,  141,  143,  143,  144,  145,  145,  148,  152,  154,  156,
156,  156,  158,  159,  161,  164,  169,  169,  170,  170,  172,
172,  173,  174,  174,  175,  179,  179,  181,  182,  183,  187,
188,  188,  189,  189,  189,  194,  194,  195,  195,  196,  197,
200,  203,  203,  205,  206,  208,  209,  210,  211,  211,  214,
216,  218,  222,  222,  224,  227,  234,  234,  234,  236,  236,
240,  240,  241,  245,  245,  249,  249,  249,  251,  252,  252,
257,  257,  258,  258,  262,  269,  273,  273,  279,  282,  287,
291,  293,  294,  295,  296,  297,  297,  300,  302,  303,  303,
307,  308,  318,  320,  325,  330,  358,  358,  359,  363,  376,
380,  391,  394,  397,  401,  405,  409,  411,  413,  418,  418,
419,  431,  434,  434,  443,  445,  448,  448,  455,  461,  463,
468,  471,  476,  482,  482,  483,  484,  485,  494,  494,  497,
499,  503,  512,  513,  517,  520,  524,  526,  535,  536,  538,
545,  552,  553,  557,  561,  563,  563,  576,  585,  602,  608,
614,  614,  616,  619,  627,  633,  635,  637,  649,  674,  676,
679,  684,  693,  704,  712,  736,  745,  756,  777,  794,  810,
815,  824,  827,  836,  838,  838,  841,  865,  890,  893,  900,
916,  936,  945,  966,  984,  991, 1020, 1027, 1052, 1078, 1184,
1275, 1545]

What I want to know is: given an invoice, what is the probability it falls between a range of values? For example, say I'm given a random invoice. What is the probability the invoice took between 200 and 400 days to receive payment?

I used numpy to find the mean and standard deviation of the list above, and using that I get a normal distribution that looks like this:

normal distribution

Now immediately you can see the problem... a normal distribution apparently won't work because you can't have a negative number of payment days. So as a result the area under the curve from 0-infinity is not 100%. Which means I don't trust the values I get (according to this curve the probability a random invoice will be between 200-400 days for payment is 29% which seems too high).

So my question is... what kind of distribution should I search here for modelling the probability of this scenario? I'm guessing not a normal distribution. I tried searching "non normal probability distribution" but it seems like there are dozens and dozens of functions out there and I'm not sure I can just take area under any probability curve to determine probability like I can with the normal distribution.

2

There are 2 best solutions below

1
On BEST ANSWER

I suspect that the data follows a geometric distribution. Indeed, the mean of the given 343 samples is

$$ \mu = \text{mean} \approx 290.117, $$

and the figure below compares the cumulative histogram from the given data (colored orange) and the graph of the CDF of the geometric distribution with mean $\mu$ (colored blue).

cumulative histogram

Even comparing the probability histogram and the PMF of the geometric distribution with mean $\mu$ looks not bad:

probability histogram

Then the probability that a day falls between $200$ and $400$, predicted using this model, is about $25.1\%$, which is not too far from the value $20.4\%$ predicted from the empirical distribution for the given data.

Also note that the geometric distributions are distributions taking values in $\{0,1,2,\ldots\}$ that are characterizes by the memoryless property, that is, a geometric distribution is the distribution for the number of trials until the first success where the same trial with two outcomes (success/failure) is performed repeatedly.

That being said, this observation seems suggest that people tend to "successfully remember and pay the invoices" with a certain probability each day, independently of all the other days.

Of course, take my claim with a grain of salt. I am not an expertise in statistics, so perhaps people in Cross Validated Stack Exchange might provide a better help on this matter!

1
On

As you write, a normal distribution -- and in particular, the normal distribution with your data's mean and variance -- covers negative values, which may not make sense in a given situation. That said, often it is close enough for us to ignore such issues: The probability that we obtain a negative value using the normal distribution is small enough that we needn't let the possibility trouble us. See, for example, the normal approximations of the binomial, poisson or hypergeometric distributions.

That said, while the normal distribution comes up all over the place, there may not be any reason to assume a priori that your data has that distribution. Your problem is one of a waiting time. One distribution used to model waiting time is the exponential distribution. (There are other possibilities; the field of survival analysis deals with modeling the time until an event takes place.) The exponential distribution, by the way, is the continuous distribution with the memorylessness property referred to in Sangchul's answer above.

In any case, the key here is to not make any assumptions; to have suspicions of an appropriate distribution is okay -- and as you learn more about what's used where, you'll probably find that your suspicions are correct -- but then we have to test those suspicions. Start by looking at a histogram of your data; compare it to the pdf of a distribution you think might work. If you want to be rigorous, once you think you've found a match, perform a chi-square goodness-of-fit test on your data. This will allow you to quantify just how well your data fits the model you think is a match.