What are the random variables that constitute i.i.d. samples?

153 Views Asked by At

Consider the following passage from Wooldridge (2010):

"For much of this book we adopt a random sampling assumption. More precisely, we assume that (1) a population model has been specified and (2) an independent, identically distributed (i.i.d.) sample can be drawn from the population."

When we refer to a sample as i.i.d., what random variables are we referring to? It seems that there are two possibilities:

  1. Are we saying that each observation is a random variable? For example, do we have a random variable for observation 1, another for observation 2, etc.? In this case, it seems each random variable would be a random vector since our population model has, at the very least, one dependent variable and one covariate.
  2. Are the random variables that constitute the i.i.d. sample the dependent variable and each covariate in our population model? In this case, are the observations for each variable in our population model possible realizations for each random variable?

Source: Wooldridge, J. M. 2010. Econometric Analysis of Cross Section and Panel Data. Second Edition.

2

There are 2 best solutions below

0
On
  1. Yes, each observation is a random variable. Yes, if the observation has an outcome variable and covariates, then it would be a random vector. And you would have a random matrix for your entire sample.

  2. Yes, that's also correct. Each random variable consists of a dependent part and the covariates that predict it, the independent part(s). For example, one observation could have the characteristics wage, education, experience, married. Wage is a dependent variable and education, experience, and married are the independent variables. Of course, you could define your model another way and choose different dependent variables, but generally you'd be interested in a particular dependent variable of interest that you're trying to predict.

0
On

I will present two possible scenarios. The first one corresponds to the i.i.d vectors $Z_i = (Y_i, X_i)$. Assume that $X_i$ is diet type of the $i$th subject, and $Y_i$ is its weight diff after a certain time period. If you allocate the subject randomly to the diet type (assume that there are two types; $0$ - placebo, and $1$ - some diet), then you have $n$ independent random vectors $Z_i$, that have the same distribution. Alternatively, assume that the allocation is not random and is determined by some rule (e.g., function of the initial weight), then you have $n_1$ i.i.d random variables of the type $Y_i|X_i=0$, and $n_2$ i.i.d random variables of the type $Y_i|X_i=1$, such that $n= n_1 + n_2$. Consequently, the data generating model for the first case is usually assumed to be something like $Y_i = \beta_0 + \beta_1X_i + \epsilon_i$, where $X_i \sim Ber(p)$ for $i=1,...n$ while, for the second case $Y_i = \alpha_0 + \alpha_1x_i + \xi_i$, for $i=1,...,n$, where $x_i \in \{0, 1\}$.