what is the influence of the specific statistical model selection in a practical project

22 Views Asked by At

I hope this is the right place to ask this question. But if it is not, please feel free to migrate. There is a famous quote, which is like "all models are wrong, but a few are useful". So, I was just wondering, what is the influence of the model selection in a practical project? For instance, suppose we have two models, one is statistically more precise than the other one. So, the more preciser one should be used instead of the not so precise one? Is this true? Many thanks for your time and attention.

1

There are 1 best solutions below

2
On

There is almost always a choice to be made between a simple model that is easy to handle but may not be exactly correct, and a more complex model that is more nearly correct and less convenient to deal with.

One example is the famous birthday problem. It is said that if there are 23 randomly chosen people in a room, there is a little more than a 50-50 chance that two of them will have the same birthday. The computation is not difficult if we assume there are only 365 birthdays (ignoring Feb 29) and that they are equally likely. We know neither of these assumptions is exactly correct. (Of course, we also hope the people are really chosen at random: a convention of twins or a meeting of the Scorpian Society would be a disaster.) Through really advanced conbinatorial analysis or fairly simple computer simulation, one can include Feb 29 and use actual US birthday frequencies from government records (roughly speaking summer birthdays are a little more frequent). Then it turns out that the probabilities of birthday coincidences are about the same to two decimal places, but not exactly. In a country where birthdays are very much more common in some months than others, the results might be considerably different.

Another example is in the modeling of waiting lines (queues). One often assumes that the times between arrivals of customers are distributed according to an exponential distribution. This is a pretty good model for many purposes and it is very convenient because the 'no-memory property' of exponential distributions makes it unnecessary to consider immediate past history when predicting the arrival of the next customer. In some situations the simple exponential model just doesn't work and then the distribution theory can be quite complicated.

A third example is that the famous normal distribution (bell curve) is often used to model heights of people, sizes of error in scientific measurements, scores on college entrance exams, and so on. There are sound theoretical reasons why the normal distribution is so often useful. But it is rare that the fit is perfect. (For example, a normal model necessarily includes a very tiny chance of a person with negative height. Easily ignored, but obviously wrong.)

Unfortunately, real life is very complicated and it is almost never possible to make an absolutely perfect model of a chance situation. Hence the quote you mentioned about all models being wrong, but some useful. (I believe G.E.P. Box is usually credited with that quote.)