I am wondering, what would be a good method for choosing a suitable probability distribution to fit a certain criteria.
For example,
I am wanting to choose a suitable distribution, and specify the parameters that would model the following:
"The number of reported cases of pathogenic bacteria such as E.coli in Texas in a given year" ( Not based off any real statistics)
Well , to me it seems like this would follow something like a normal distribution.
But it would be shifted to the right. Ie, most years say it around 500 cases, but some years it is 2000+ and some years only 20, but those years are much more rare.
That is another issue I have. How could I make a normal curve that is never negative? Because the least cases that could be reported would be zero. Also, the number of cases would be a discrete number, but could I still model it with a continuous distribution? I am still confused and wondering about this. Please can anyone help?
Any recommendations or advice? I came to this site because I thought I could get help from people much smarter then myself. It seems like no matter what I do I cannot get any help. Please if you have any other sites that help let me know about them
The question is about a method for choosing a probability distribution to model a particular situation. The first basic question is what the support of the distribution is? Or, phrased differently, for which numbers is the distribution defined?
As the OP pointed out, there are a few reasons why the normal distribution is not the correct distribution to use.
It has probability mass on negative values (as you the OP already pointed out).
It is a distribution over real numbers and not integers.
There are at least 2 well-known distributions that are defined over the non-negative integers, which is what you need to model the number of reported cases of a disease.
The Poisson distribution is often used for rate questions. (How many events happened per unit of time.)
The Binomial distribution is often used for the number of successes or failures.
In this case, the Poisson distribution is probably the most relevant since you are modelling "The number of reported cases [...] in a given year." That is, you are modelling a rate.