Why is $x_i$ not a random variable in simple linear regression

164 Views Asked by At

I have just started studying simple linear regression. This concerns Section 9.1 in the book 'Introduction to probability and statistics for engineers ans scientists' by Sheldon Ross 10th Edition. It says:

A simple linear regression model supposes a linear relationship between the mean response and the value of a single independent variable. It can be expressed as $$Y =\alpha+\beta x+e$$ where $x$ is the value of the independent variable, also called the input level, $Y$ is the response, and $e$, representing the random error, is a random variable having mean $0$. Suppose that the responses $Y_i$ corresponding to the input values $x_i , i = 1, . . . , n$ are to be observed and used to estimate $\alpha$ and $\beta$ in a simple linear regression model.

To determine estimators of $\alpha$ and $\beta$ we reason as follows: If $A$ is the estimator of $\alpha$ and $B$ of $\beta$, then the estimator of the response corresponding to the input variable $x_i$ would be $A + Bx_i.$

To specify the distribution of the estimators $A$ and $B$, it is necessary to make additional assumptions about the random errors aside from just assuming that their mean is $0$. The usual approach is to assume that the random errors are independent normal random variables having mean $0$ and variance $\sigma^2$. That is, we suppose that if $Y_i$ is the response corresponding to the input value $x_i$ , then $Y_1 , . . . , Y_n$ are independent and $$Y_i\sim\mathcal(\alpha+\beta x_i,\sigma^2)$$.

My questions are:

  1. Why are $x_i$ not being considered as independent random variables? Do we not consider $x_i$'s as sample data with an underlying distribution?
  2. Why is the error being called random? Why is it a random variable? What is its domain? And why is it normally distributed?

If you can understand what my confusion is about, can you also please explain using examples.

2

There are 2 best solutions below

3
On

The answer to both of your questions is "because that's how we're defining the model". We are assuming that we are able to observe $x_i$ and $y_i$, and that if we know $x_i$ for some record then we are able to predict $y_i$ with some amount of error.

So for the purposes of this model, it doesn't matter whether the $x_i$ are fixed or random, because our predictions are always going to be conditioned on the value of $x_i$ regardless. (If you wanted to model the $x_i$ as a random variable and make some inference about the underlying model of the $y_i$ then that's a separate step.)

As for the error terms $\varepsilon_i$ being i.i.d. normal, again that's just an assumption. The idea is that once we remove the effect of the $x_i$, we want to assume that the remaining error is pure white noise with no structure.

Ultimately this comes down to an old adage - "No model is correct, but some models are useful."

0
On

ConMan did an excellent job, he/she is stating your question is the definition/assumption for a Simple Linear Regression model. In other words, that is just how Simple Linear Regression models are defined. (A starting point for analysis).

The goal of a simple linear regression is to find the best fit. The independent variable, X, is used to predict your response, dependent, variable Y.

The error term after a model is fit is then called a residual which is the difference between the, observed values of Y, dependent variable, and the predicted/fitted values, Yhat. X and Y are observed followed by Yhat being the fitted values after performing your regression analysis.

My answers to your questions.

1a. X, the independent variable, are independent random variables. By definition a random variable is a variable whose specific outcome is assumed to arise by chance or according to some random or stochastic mechanism. Yi, your response, is also a random variable that is dependent on the value of Xi. Hence Y is dependent on X. Your fitted model is the stochastic mechanism used to derive Yhat. Think of random variables as being tied to a function that defines these probabilities, i.e. probability distribution.

1b. The independent variable, X, will have a sample distribution. (This another definition from Statistics.) In my experience, the sample distribution comes into effect when using the fitted model to make predictions. For example, after you have fitted a model to a certain dataset. Now you want to use the new fitted model, you would want to confirm that the X value you are using is within the sample distribution of the initial dataset. If the new X value is an outlier by a few standard deviations then you may return a spurious output. To add, I consider the sample distribution of X for post fitting uses.

2a. The error term is random because you assume that any raw value of the response is owed a linear change in a given value X, plus or minus some random, residual variation or normally distributed noise. Another way of looking at why the error is random. The residual is the difference between the dependent variable, Y, and the fitted values, Yhat. When you used the Yhat model to make a future prediction based on the new input X, you will not know the distribution until after the model returns a result. And the randomness will have a probability distribution around the mean response value.

2b. The error term is assumed to be normally distributed. Hold the image of a normal distribution curve while going through the following. The sum of your residuals is 0. Go ahead and pull any dataset you want and run the analysis. Then calculate the residual, Y - Yhat, observed - predicted, for each individual observation. Sum the residuals, they will sum to 0. This value 0 is the center of your normal distribution.

The pictures, links, have examples for you to understand how a Simple Linear Regression is run. You can duplicate them in Excel. Or R or Python. Data Output Statistics

I hope this helps. Also I must credit Dr. Song at Purdue University for teaching me Applied Statistics and Dr. Davies "Book of R" for helping me understand Simple Linear Regressions.