Conditional mean of distribution?

475 Views Asked by At

In a simple linear regression the predicted y values are also the “conditional means” at each x value. For each x value, there is a distribution of y values in the population. How exactly do we know each y value on the regression line is the mean of each conditional distribution for each y value?

I’m trying to think of this in the most simple way possible, with 10 x values and 10 y values. If y on the regression line is 5 when x is 1, then one would say “when x is 1, the mean value of y is 5.” How does the line tell us the “mean” of y when we only have one actual y value to work with?

3

There are 3 best solutions below

2
On

Assume the random variable $Y$ can be modeled as $Y=\beta_0+\beta_1X_1+\dots+\beta_nX_n+\epsilon$ with $\epsilon\sim N(0,1)$ the random error term and $X_i$ being random variables.

After solving for the parameters $\beta_0,\beta_1,\dots,\beta_n$ via least squares, the random variable $Y$ is $\beta_0+\beta_1X_1+\dots+\beta_nX_n+\epsilon$ with $\beta_i$ filled in as actual numbers.

Then given $X_1=x_1,\dots,X_n=x_n$, $Y$ given $\textbf x$ is $\beta_0+\beta_1x_1+\dots+\beta_nx_n+\epsilon$. This is normal with mean $\beta_0+\beta_1x_1+\dots+ \beta_nx_n$ and variance 1. That is, $E(Y|\textbf x)=\beta_0+\beta_1x_1+\dots+ \beta_nx_n$.

But this is exactly the value of the least squares regression line evaluated at $\textbf x$.

4
On

Starting from your example, from ten $(x_i, y_i)$ points the parameters of 2-dimensional normal distribution will be infered in the case of linear regression (see https://en.wikipedia.org/wiki/Multivariate_normal_distribution for details), typically two-dimensional mean and 2x2 covariance matrix. Then, for $x=1$, the conditional distribituion for $y$ will be computed (see https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Bivariate_conditional_expectation), and it can be shown that the mean of this distribution is the value predicted by linear regression.

2
On

How exactly do we know each y value on the regression line is the mean of each conditional distribution for each y value?

To understand what's going on here, I think it's important to separate the theoretical probabilistic model from the parameter fitting algorithm based on actual observed data. On the one hand, we have the linear model, which is an abstract mathematical structure, and on the other hand, we have the regression line that has been calculated numerically from data, using say ordinary least squares. So I will break the explanation into these two parts.

The linear model

In this section, we'll go over the definition of a linear model. First, we assume we have random variables $Y, X_1, X_2, \ldots, X_p, \epsilon$ on the same sample space $\Omega$. This means in particular that these are all functions from the same set $\Omega$ to the real numbers. That is,

$Y, X_1, X_2, \ldots, X_p, \epsilon : \Omega \rightarrow \mathbb{R}$.

Now suppose there are constants $\beta_0, \beta_1, \beta_2, \ldots, \beta_p \in \mathbb{R}$ such that

  1. $Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon$, and

  2. $E[Y \mid X_1 = x_1, X_2 = x_2, \ldots, X_p = x_p] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \cdots + \beta_p x_p$.

Then we say that there is a linear model of $Y$ in terms of $X_1, X_2, \ldots, X_p$, and we refer to the formula in criterion (1) as the linear model itself. The $\beta_i$ are called the coefficients or parameters of the model.

Note that both criteria (1) and (2) are definitions, and both are needed for the definition of a linear model. That is, there is no derivation of them. However, we can explain the intuition behind them as follows. Criterion (2) says that for each fixed set of values $x_i$ of the variables $X_i$, the value of $Y$ will be, on average, a linear combination of the $x_i$ (plus a constant $\beta_0$), hence the term linear model. Note that we are only enforcing this strict linear relationship on average. We are allowing for the possibility that other values of $Y$ can occur, which we express formally through the use of the random variable $\epsilon$ in criterion (1). We call $\epsilon$ the $error$.

Taking the conditional expected value of both sides of equation (1), and then plugging in equation (2), we find that

$E[\epsilon \mid X_1 = x_1, X_2 = x_2, \ldots, X_p = x_p] = 0$.

That is, the error has conditional mean $0$. We have derived this fact from the definition of a linear model.

Estimating the parameters of a linear model using data

Now suppose we have a collection of data $(y_1, y_2, \ldots, y_n)$. The connection to the above section comes by thinking of each observation $y_i \in \mathbb{R}$ as a realization, or observed value, of a random variable $Y$. Upon observing a given $y_i$, we may have also observed other quantities $x_{i1}, x_{i2}, \ldots, x_{ip} \in \mathbb{R}$. Each such $x_{ij}$ is also thought of as the realization of a random variable $X_j$. For example, we could study the population of stars in the Milky Way galaxy, and for a sample of $n$ such stars we could record that star $i$ has apparent brightness $y_i$, distance from the Earth $x_{i1}$, and frequency of emitted light (color) $x_{i2}$. In this case, we would be trying to model apparent brightness $Y$ as a function of distance $X_1$ and color $X_2$.

These variables $Y, X_1, X_2, \ldots, X_p$ we have defined may or may not actually obey the properties of a linear model, defined above, on the given population. If they don't, then we can still write down the corresponding model (as in equation (1) above) -- it just won't line up well with the data.

But suppose they do obey a linear model. Then by definition there are constants $\beta_i \in \mathbb{R}$ serving as the coefficients of this model, but we may not know their true values. A natural question is then, can we use the data itself to determine the $\beta_i$, or at least estimate them closely? This is what a fitting algorithm like ordinary least squares accomplishes.

We can now answer your question, quoted at the beginning. If we use a consistent estimator, such as ordinary least squares, then $\hat{\beta}_i$ converges in probability to $\beta_i$ as the sample size $n$ increases. So with enough data, the estimate $\hat{\beta}_i$ should be close to the true $\beta_i$. And in turn, if the estimates for $\beta_i$ are close to their true values, then the resulting regression function (a line, if $p = 1$) will be close to the true function defined by the right hand side of equation (2), which is by definition (the left hand side of equation (2)) the conditional expected value of $Y$. And for any fixed $\mathbf{x} = (x_1, x_2, \ldots, x_p)$, if there are enough observed $y$ values at this $\mathbf{x}$, then their mean should be close to the conditional expected value of $Y$ at $\mathbf{x}$, thanks to the law of large numbers.

Note that when I say "should be" in the paragraph above, this means with high probability, but not with certainty. Indeed, this probability might never reach $1$ with a finite sample size, if the population is infinite.

Summary

In summary, assuming a linear model means that you are assuming this is how the data will actually behave. If the population actually does follow the linear model you've proposed, then with enough observations you should see all the properties of the linear model, described above, realized in your actual data.

In particular, for any fixed observed $\mathbf{x} = (x_1, x_2, \ldots, x_p)$, the mean of the corresponding observed $y$ values should be close to the $y$ value on the ordinary least squares regression line. If it isn’t, this means (short of witnessing a very low probability event) that either the population you are studying doesn't follow this model perfectly, or you haven't collected enough data. Both of these shortcomings are probably going to happen in practice, which is why you can still see plenty of disagreements between mean observed $y$ values and regression line $y$ values, even though you are using a theoretical framework in which they agree.