Does correlation always imply proportion of variance of variable $y$ explained by variable $x$?

168 Views Asked by At

Suppose you have the area of a house on the $x$-axis and its price on the $y$-axis. When you compute a regression model that predicts the housing price from its area, the square of the correlation coefficient

$r^2 = 1 - SSE/(\Sigma(y_i - \bar{y})^2$

(where $SSE$ is the sum of squared errors) corresponds to the proportion of variance in the housing prices that is "explained" by your model. This much makes sense.

Now suppose you do NOT compute a regression model - instead, you measure something like, e.g., number of windows, that just happens to be highly correlated with housing price. If you put the number of windows on the $x$-axis and the housing price on the $y$-axis, suppose you get something that looks a bit like a straight line. Now you can still compute the correlation coefficient between those two variables. Does the square of this correlation coefficient correspond to percent of variance of housing price that is explained by the number of windows?

1

There are 1 best solutions below

0
On

It is more of a comment than an answer, bu it was too long for the comments section.

Can you please clarify your question?

Regression and the corr. coefficient are measures of linear dependence between variables. Nothing more. Whether $X$ "explains" the variance of $Y$ or vice versa, is interpretation that depends on the context. In your example - why should number of windows "explain" the price and not price the number of windows? If you have a limited budget to buy a house, then your budget determines the number of windows that you'll have in you future house, and not the other way around. Namely, number of windows $X$ and price $Y$ is a vector of two dependent random variables, $Z = (X, Y)$. This dependency can be measured using e.g., the linear correlation coefficients. If $Z$ is multivariate normal then your correlation coefficient converges to $\rho$ which is indeed very informative measure as the mutual information is its function. Intuitively, $\rho$ contains "everything" that is common to $X$ and $Y$, i.e., no more information be derived from $X$ about $Y$ then you have in $\rho$. Otherwise, where the distribution of $Z$ is unknown, then $r_{X,Y}$ or its square is just an ``innocent" measure of linear dependency between $X$ and $Y$ and its adequacy, eithout proper context or theory, may be questionable.