IID variables in statistics and real-life assumptions

901 Views Asked by At

IID (Independent and Identically Distributed) Random variables are often used in statistics, where a truly random sample is assumed to be made of IID variables. I'm studying basics of statistics (as a med student), and I am getting worked up by the following.

Let us suppose we are sampling the population to find out the hypertensive status of the people(discrete variable). To obtain statistical inferences about the population distribution, we assume that this sample is composed of a set of IID variables, but as to quote my teacher, "...is a big assumption and hardly ever perfectly true in real life...". But I don't understand how it is not true? The status of one of our sampled person is not going to affect the status of the other, and hence they are bound to be independent. Moreover, the entire population will have some distribution of hypertensive patients, and since this sample is a part of that population, it is bound to follow the probabilistic distribution of the entire population. I fail to understand how is this not the case in real life?

ADDENDUM

Is the designation of IID a contextual designation?

Imagine there is the population of New York. It has Blacks (1000) and a thousand Caucasians. We are studying hypertension, and let us assume that in reality, hypertension depends only on race, such that each of these races when plotted individually, gives a perfect normal curve, with no familial predispositions, with the mean of the black distribution being 190 and that of the white being 170. The overall New York plot is hence Bimodal. Now look at the following biased studies, both of which commonly cited as examples of IID breakdowns:

1. Biased sampling Our hospital being nearer to the black community, had only blacks coming in for the study. Here, our Random Variables, which are just the hypertension values of the 1st person, the 2nd person and so on, are independent (Hypertension depends only on race. So irrespective of what the first person's value is, the second person's value will be a random pick from the Black population's curve.) They are identically distributed, i.e. identical to one another following the black normal curve. Now if we generalize our obtained sample statistics (mean of 190) to the entire New York, our study would be a false representation, but why is our assumption of IID wrong here?? I fail to understand how is independence violated here, maybe because I fail to understand independence in the first place!

2. Wrong Modeling Let us say we have a perfectly random sampling. We have equal Caucasians and Blacks coming in and we get an average of 180. Here each of the picks was independent of the previous pick. And they followed the same bimodal distribution for New York. But places where this example is cited as an IID breakdown, they say, that if we are modeling the New York population as a normal curve (and not as a bimodal curve), then the Caucasian picks and the Black picks follow different distribution and hence are not IID. But why would our assumptions on the population change in any way what PDF the random variables follow!? Am I missing something here?

2

There are 2 best solutions below

1
On

I guess that we can agree on the premises that hypertensive status affected or at least correlated with factors like age, genetics, diet, sport habbits etc.

Assume that you have a clinic and you are measuring BP to each person that comes to your clinic (from whatever reason). Assuming, that the measured values are absolutely independent is unrealistic. As your patients are often accompanied with their relatives such that they share some common genetic factors, or accompanied with their spouses and friends so they probably have similar diets, sports habits. etc. Furthermore, probably the whole population in the area of your clinic share some common factors (perhaps, socioeconomic status that is probably correlated with diet and sports habits and etc.). As such, your sample not only dependent, it is also probably biased (assuming that you target population is the whole country or something like that).

To summarize, you have to plan your sampling strategy and tactics very carefully if you want (seriously) rely on the IID assumptions. Otherwise, you have whether to model the interdependence or address the fact that the IID assumption is quite simplistic and may jeopardize your statistical analysis.

5
On

There are many ways in which the assumption of IID variables can be broken in real-life applications. Let me illustrate how it can break down in your case of hypertension. I'm not a medical doctor or epidemiologist, so I'll make up some assumptions that are at least plausible to a non-medical doctor.

Suppose people who work hard and are financially successful are more likely to have access to elite medical centers where these trials are to take place. There will be, then, a correlation between their high hypertension that would not occur if the trials were performed at all medical centers.

Likewise, high performers may have access to the internet, and hence become aware of solicitations for participation in trials.

Conversely, high performers may be too busy to participate, so they become underrepresented.

A participant may tell his friends that he's in trials. Thus when a friend is approached by the hospital, that friend may be more likely to participate (because his friend is participating). Thus if there is any correlation among friends (hard-charging folk tend to have hard-charging friends), there will be a selection bias.

And so on....