Trying to understand errors-in-variables and how this affect the choice of number of subjects in a study

27 Views Asked by At

I am twisting my brain on some voluntary exercises we have received in our Data Analytics class in my study. We have a dataset of respondents to a imaginary analysis, and by using linear regression to look for correlations in the variables we have, we got a r squared value which were quite low. (Down into the 0.3s) What I want to find out, however, is if it would be better to conduct the study based on a different variable in the same dataset, which brings in around 10-100 times the amount of respondents?

The exercise is made to challenge those who want, and I know that we should look into errors-in-variables models to explain it. The problem is I kind of struggle to understand how this applies to my problem. After reading about errors-in-variables for a while I believe I have narrowed it down into being related to how much error it is likely you get and how this affect the dataset when it grows. I think I got a grasping of how errors-in-variables work and how the linear regression formula looks like in its expanded version (we have been calculating without errors for now), but I want to understand how they affect each other and how errors-in-variables is affected by the number of observations used in calculation.

TL;DR/specific question: How will the r squared values of a linear regressions model possibly be affected if the number of values in the dataset used for computing increases extremely, explained using errors-in-variables models. I am not asking if, since that is not how Stackoverflow works, I know. I want to understand why.

1

There are 1 best solutions below

5
On BEST ANSWER

Only if the original number of data points was too low and therefore subject to excessive random variability. Otherwise, for the same two variables, it's likely just more of the same and won't change $r^2$ very much.

Variability tends to follow the dimension of the minor axis of an ellipse which contains all the data points and whose major axis is the trend line. Adding more data points tends to just add more points into the ellipse without changing the minor axis. That is, the cloud of points just becomes more dense but not more concentrated or spread out about the trend line.

It is always better to have more data points than less.