Linear Regression Function

141 Views Asked by At

Some books write linear regression function in the following way:

$$ Y = a + b \times X + u$$

While others write it in the following way:

$$ Y_i = a + b \times X_i + u_i$$

Why is it necessary to use index? Are these two equivalent? Does $ Y_i $ and $X_i$ in the second case refer to one particular observation or are they still variables?

In the book that uses the second notation author writes: $ E(Y|X_i) $ is a function of $ X_i $, where $ X_i $ is a given value of X.

How can function $ E(Y|X_i) $ be a function if its argument is a specific number? Should not it be a variable?

Later, it writes: $ E(Y_i|X_i) $

I am really confused with this notations and can anyone help me finding this notational puzzle out?

1

There are 1 best solutions below

1
On BEST ANSWER

Mostly the first form is just a shorthand for the second. Both of the models describe random variables, while the index $i$ denote the number of such "processes". Namely, let $Y$ be weight and $X$ height, the basic model that you can fit is $$ Y = a + bX + \epsilon, $$ namely, the weight equals an affine combination of the height $X$ and some orthogonal noise term $\epsilon$. Such model is sometimes called the "Data generating model". Now you want to talk about some $n$ individuals, hence you assume that the aforementioned model describes the relationship between height and weight in all your $n$ individuals, hence for $i=1,...,n$, $$ Y_i = a + bX_i + \epsilon_i. $$ Now as you are about to model $n$ people, you have $n$ random variables $Y_i$ (or more formally $Y_i|X_i=x$, namely the $X$s are given and you try to "predict" the $\mathbb{E}[Y|X=x]$). Formaly $$ \epsilon_i \sim^{iid} N(0, \sigma ^ 2), $$ hence $$ Y_i|X_i = x_i \sim N(a + bx_i, \sigma ^2). $$ That means that you assume that your individuals are mutually independent and that all of them share the same $a$, $b$ and $\sigma ^ 2$.

Now you get a random sample of $n$ individuals and you "leave" the realm of random variables. I.e., you have $n$ pairs of $(y_i, x_i)$ (vectors) and you fit a model, i.e., you estimate $\hat{a}$ and $\hat{b}$, now the observations satisfy. $$ y_i = \hat{a} + \hat{b} x_i + e_i, $$ where $e_i = y_i - \hat{y}_i$. Now everything is numbers, not random variables. $y_i$ and $x_i$ are the observed realizations and $\hat{a}$ and $\hat{b}$ are computed values (usually, OLS).

Regrading $\mathbb{E}[Y|X]$ and $\mathbb{E}[Y_i|X_i]$ you can apply the same explanation where in both cases it is just a common "abuse of notation", i.e., writing these instead of $\mathbb{E}[Y_i|X_i=x_i]$.

However, $\mathbb{E}[Y|X]$ may also be shorthand for $\mathbb{E}[Y|\sigma(X)]$, where $\sigma(X)$ is a sigma algebra generated by $X$. Then $\mathbb{E}[Y|\sigma(X)] = h(X)$ is a well-defined unique random variable w.r.t $\sigma(X)$ and not a constant $a + bx_i$. Some details about this point of view you can find here Expectation and orthogonal projection.

In a nutshell, $$ Y = E[Y|\sigma(X)+\epsilon, $$ describes an orthogonal (can you show it?) decomposition of $Y$. To get some intuition, take the two extremes. The first one is when $\sigma(X) = \{ \Omega, \emptyset \}$, i.e, $\sigma(X)$ has no information relevant to $Y$, thus $$ Y = E[Y|\sigma(X)+\epsilon = \mathbb{E}[Y]+ \epsilon, $$ this case correspond to case where $X$ is independent of $Y$, hence the true model is $Y = a + \epsilon$, no variance of $Y$ is "explained" by $X$. Now, consider $\sigma(X) = \sigma(Y)$, thus $$ Y = E[Y|\sigma(X)+\epsilon = Y+ \epsilon, $$ hence $ \epsilon = 0$ a.s. Namely, everything that could be known about $Y$ is known, so there is no noise (no "unexplained" variance of $Y$). For any other case, $\mathbb{E}|Y|\sigma(X)]$ bears some incomplete information about $Y$, thus the variance of the noise term getting smaller. This is what causes the "explained variance" (estimated $R^2$) to goes up as you add more variables.

Note: The last part was a lot of "hand-waving", a rigorous treatment requires detailed definitions of the probability spaces and usually omitted/skipped in introductory courses in regression.