A linear regression model can be described as: $$ y = \beta_0 + \beta_1 X + \epsilon $$ where $\epsilon$ is the zero mean normal error. My question is: Is $X$ random variable? If no, then how can we define $$ \mathbb{E}[y\mid X]$$ as $X$ is a deterministic. If yes, then what do we mean by this (found in the wikipedia page) "we want to find how changing the value of $X$, changes $typical/expected$ value of $y$?" Are we talking about an instance of random X or expected value of X? link for wiki page: https://en.wikipedia.org/wiki/Regression_analysis
Are dependent variables random?
247 Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail AtThere are 2 best solutions below
On
No, $X$ is a not a random variable in normal linear regression analysis.
The setup is as follows. We are given a set of paired data, $$\{ (x_1, y_1), \dots, (x_n , y_n)\}.$$
We view the $x_i$'s as constants. Then, for each $i$, we view $Y_i$ as a normally distributed random variable, with mean $\beta_0 + \beta_1 x_i$ and variance $\sigma^2$.
The $Y_i$'s are assumed to be independent, and the values of $\beta_0, \beta_1$ and $\sigma^2$ are the same for each $i$. Thus, $$\mathbb E[Y_i] = \beta_0 + \beta_1 x_i.$$ (So yes, I agree that writing "$\mathbb E[Y_i \mid x_i]$" instead of "$\mathbb E[Y_i]$" can be a bit misleading! People write "$\mathbb E[Y_i \mid x_i ] $" to indicate that "there is an expression for $\mathbb E[Y_i]$ given solely in terms of the value of $x_i$", but this should NOT be read as saying that $Y_i$ and $x_i$ are both random variables with a joint probability distribution and $\mathbb E[Y_i \mid x_i]$ is a conditional probability!)
In a sense, whenever you find a conditional probability or a conditional expected value or a conditional variance, you are temporarily treating the event of the random variable on which you condition as if it were not random. Suppose $X$ and $Y$ are random variables and you seek $\operatorname{E}(X^2 Y \mid X).$ That conditional expected value is $X^2\operatorname{E}(Y \mid X).$ The factor $X^2$ is pulled out of the expectation operator just as if it had been (for example) the number $3.$
Very statistical data originate in a way that results in pairs $(X_i,Y_i)$ for $i=1,\ldots,n$ in which both components are random, and then linear regression is used. The probabilistic models on which are based such tests as whether the correlation is significant often assume $X$ is not random and $Y$ is random. The justification for that is that one is interested in the conditional expected value and conditional variance of $Y$ given $X$.
Sometimes, however, the $X$ values can be chosen by the experimenter, and the $Y$ values are then provided by nature (or things outside the experimenter's control). The same statistical methods may then be applied, and again one is interested in the distribution of $Y$ given $X$. In such a case $X$ is not random, but in either case there is a justification for treating it as not random.