What is the correct format for the formula of linear regression?

72 Views Asked by At

Lets say you have $i$ independent variables, ($y_1$, $y_2$, ... $y_i$), and each of them have the SAME two predictors, $x_1$ and $x_2$. I thought that the formula for the linear regression model for each $y$ would be:

$$ y_i = \beta_0 + \beta_{i,1}x_1 + \beta_{i,2}x_2 + \epsilon_i $$

But based on Wikipedia, the formula looks like it would be:

$$ y_i = \beta_0 + \beta_1x_{i,1} + \beta_{2}x_{i,2} + \epsilon_i$$

Here is a picture from another site:

enter image description here

This formula suggests there are $n*p$ predictor variables with $p$ unique ones for each response variable.

Why is my formula incorrect? Shouldn't the coefficients change depending on which response variable I am trying to model?

2

There are 2 best solutions below

0
On

In a Layman's term Linear Regression is about finding a best fit line for your statistical data distribution for a given set of input variables/predictiors and response/out-puts. The best fit is achieved by tweaking the slope(in a single variable model with only one predictor) i.e $y_i = mx_i +c$ where m is the slope and c is the bias. So what you eventually do is STEP 1: start with a random value for m and c. STEP 2: find the response given by your model for corresponding inputs. STEP 3: use a loss function (like MSE-Mean squared error)to determine how well did your model perform as compared to the actual data. You again start from step 1 but with a different value for m amd c and then again carry out step 2 amd 3 till you find out that value of m and c for which the loss function gives minimum value.

Now if you noticed properly in each iteration we keep the m and c constant while changing the value of $x$ and recording the response $y$ and eventually we find that optimal value of m and c for which our model works the best.

I took the simple linear regression equation for the ease of understanding.

0
On

Assume that the data generating process is $$ Y = \beta_0 + \beta_1X + \epsilon, $$ here $X$ is a variable (either random or not) and $\epsilon$ is a random variable. The $\beta_0$ and $\beta_1$ are unknown parametrs. Now, you observe $n$ realizations of this process that results in $n$ data points $\{(y_i, x_i)\}_{i=1}^n$. Namely, you want to use these $n$ data points in order to estimate the unknown $\beta_0$ and $\beta_1$. Therefore, you fit $n$ linear equations of the form $$ y_i = \beta_0 + \beta_1x_i,\quad i=1,...,n $$ that you want to solve w.r.t. $\beta_0$ and $\beta_1$. Clearly there is there is no unique solution, hence you use the orthogonal projection which results in the OLS estimators.

Using your logic, you suggest that $\beta_0$ and $\beta_1$ vary by observation. This is different approach andresembles the random effects models https://en.wikipedia.org/wiki/Random_effects_model that is different from the classical linear regression problem.