What is the difference between linear regression on y with x and x with y

3.2k Views Asked by At

I'm plotting the regression line of (GDP$\%$ Change, Poverty Rate$\%$)$\to (x,y)$ in Mathematica

What would it mean if I were to switch the axis? (Poverty Rate $\%$, GDP change $%$)

(GDP$\%$,Poverty$\%$) $\to$ Regression line: $13.555 -0.168842x$

(Poverty Rate $\%$, GDP change $\%$) $\to$ Regression line: $0.275437 -0.109956x$

To put it simply, I'm attempting to understand the difference between linear regression on $y$ with $x$ and $x$ with $y$. Not just the difference in slope but what it actually means.

Thanks!

3

There are 3 best solutions below

0
On

IMHO, the "actual meaning" is not a mathematical question. I.e., if you understand the technical aspects of the changes in the coefficients, then anything else is just kind of philosophy. Namely, in a classical regression analysis you assume that the "real" underlying model that explains a poverty rate ($Y$) is the GDP ($X$) that is given by $Y = \beta_0 +\beta_1X+\epsilon$. I.e., there is some linear function with a noise term $\epsilon$, where the assumptions on the noise term determines the best estimating procedure of $\beta_0$ and $\beta_1$. In this case $Y$ is called dependent variable, whilst $X$ is independent. So, you can say that you are assuming that the poverty rate depends on the GDP level. Hence, by controlling the GDP you can alter the poverty rate. While in $X = \beta_0 +\beta_1Y+\epsilon$, your reasoning is reversed. I.e., your underlying question is "how poverty rate effects the GDP?". In both ways you are essentially estimating a linear correlation between $X$ and $Y$. The only difference is in the way you post the question and how you interpret the results.

0
On

If the least-squares line is $y = ax + b$ then for any value of $x$, the value $ax+b$ is an estimate of the average of all $y$-values for members of the populations that have the specified $x$-value.

0
On

The covariance is $s_{xy} = \frac{\sum (x_i - \bar x)(y_i - \bar y)}{n-1},$ where the sum is taken over $i = 1, \dots, n$ and $n$ is the sample size. Then the correlation is $r_{xy} = \frac{s_{x,y}}{s_x s_y},$ where $s_x$ and $x_y$ are the two standard deviations.

If you have the regression line y = 13.555 -0.1688842 x. then you might say (over the interval of your x data) that for each unit of increase in x rate there is a decrease of about .17 units of y. In your second regression line, I think you intend to have $y$ (not $x$) at the end. In that equation you are expressing an increase of about .11 units of x for each unit of decrease in y. (However, it is customary to use x on the horizontal axis and y on the vertical axis.)

For regression of y on x (with y's are on the vertical axis, and to be predicted from x's), the estimated slope is $\hat \beta_1 = s_{xy}/s_x,$ so that the units are those of y.

For regression of x on y (x on the vertical axis, a 'nonstandard' situation) the estimated slope is $\hat \beta_1^\prime = s_{xy}/s_y,$ so that the units are those of x.

Traditional statistical tests of the null hypotheses $\rho = 0,\,\beta_1 = 0,$ and $\beta_1^\prime = 0,$ (based on $r$, $\hat \beta_1$, and $\hat \beta_1^\prime$, respectively) are mathematically equivalent.

Notes: This is intended to expand on the theme of @MichaelHardy's answer. There is no necessity to bring a causal link between x and y into a 'philosophical' discussion.