Which distribution function for diseases

206 Views Asked by At

I am wondering, what would be a good method for choosing a suitable probability distribution to fit a certain criteria.

For example,

I am wanting to choose a suitable distribution, and specify the parameters that would model the following:

"The number of reported cases of pathogenic bacteria such as E.coli in Texas in a given year" ( Not based off any real statistics)

Well , to me it seems like this would follow something like a normal distribution.

But it would be shifted to the right. Ie, most years say it around 500 cases, but some years it is 2000+ and some years only 20, but those years are much more rare.

That is another issue I have. How could I make a normal curve that is never negative? Because the least cases that could be reported would be zero. Also, the number of cases would be a discrete number, but could I still model it with a continuous distribution? I am still confused and wondering about this. Please can anyone help?

Any recommendations or advice? I came to this site because I thought I could get help from people much smarter then myself. It seems like no matter what I do I cannot get any help. Please if you have any other sites that help let me know about them

2

There are 2 best solutions below

1
On BEST ANSWER

The question is about a method for choosing a probability distribution to model a particular situation. The first basic question is what the support of the distribution is? Or, phrased differently, for which numbers is the distribution defined?

As the OP pointed out, there are a few reasons why the normal distribution is not the correct distribution to use.

  • It has probability mass on negative values (as you the OP already pointed out).

  • It is a distribution over real numbers and not integers.

There are at least 2 well-known distributions that are defined over the non-negative integers, which is what you need to model the number of reported cases of a disease.

In this case, the Poisson distribution is probably the most relevant since you are modelling "The number of reported cases [...] in a given year." That is, you are modelling a rate.

4
On

An intro to probability answer:

This part of my answers assumes that this is a question for an intro to probability course. As such, I will not give what I'm sure is the right answer, but only hints.

A normal distribution is not a good distribution to model number of reported cases of a disease. Two reasons for this; first is that the outcome is not continuous, like the normal distribution is. Second, the distribution is going to somewhat right skewed.

What distributions do you know that are used for modeling counts?

An applied statistics answer:

As a statistician, I will tell you that you typically cannot use some minimal criteria about the data to figure out how it is distributed.

Several parametric distributions have interpretations that link them to the physical world. The simplest being the Bernoulli distribution (0-1 outcome, with a parameter $p = P(X = 1)$), along with several other classic interpretations: exponential distribution can be seen as the distribution of time between events with a "memory-less" aspect to it (memoryless $\rightarrow$ if $T$ is the amount of having already waited for an event, the probability the event will occur is independent of $T$), gamma distribution with an integer shape parameter is the sum of exponential distributions, Poisson is the number of events that occur in a fixed period of time if time between events are exponential, etc.

However, in practice, almost all these strong assumptions very rarely hold! Well, perhaps with exception to the Bernoulli distribution. But the point is that while these distributions would technically describe how the data should be distributed were the observed process to follow the conditions used to motivate the creation of the distribution, in practice the data very rarely actually follows the exact process.

As such, when conducting a statistical analysis, rather than making such an assumption and accepting it at face value, statisticians will at least preform model checking, where we inspect whether our guiding assumptions fit reality. Alternatively, the use of non-parametric statistics attempts to make inference on data without assumptions about how the data is distributed.