Generating good data for linear regression

88 Views Asked by At

I would like to generate a good big data set for a linear regression exercise for my students. We will try to explain the salary (continuous variable) of 50 people in terms of their results high school results (/20, we suppose that those are continuous) in 5 subjects, for example maths, physics, chemistry, English, German.

What I mean by a good data set is:

-It satisfies usual conditions for linear regression

-It has some logic to it, that is if you're good in maths, you're likely to be good in physics , and not bad in chemistry. If you're extremely good in a few subjects, then you'll unlikely fail an exam (<10/20).

-Not obvious, so students can see the importance of the linear regression. What I mean is that if I generate 1 vector of salary, 5 vectors of results, and for each vector, order it from the smallest value, to the highest and put all of them in a matrix, we don't really need linear regression to see a pattern, it is too obvious.

Does a tool exists for such generation ? How can one proceed ? Preferably in R or matlab.

1

There are 1 best solutions below

0
On BEST ANSWER

You can start with something like this and then fine tune the parameters to get the "perfect" data set

x_1 = round( rnorm(50, 11, 4), 0 )

x_2 = round( 4 + 0.6 * x_1 + rnorm(50, 0, 2), 0 )

x_3 = round( 0.5 * x_2 + 0.5 * x_2 + rnorm(50, 0, 2), 0) 

x_4 = round( x_1 + rnorm(50, 0, 2), 0)

x_5 = round( - 0.5 * x_1 + 0.5 * x_2 + 0.2 * x_3 + rnorm(50, 10, 3), 0 )

y   = round( 0.3 * x_1 - 0.3 * x_2 + 0.3 * x_3 + 0.7 * x_4 + rnorm(50, 0, 2), 0 )

m1  = lm( y ~ x_1 + x_2 + x_3 + x_4 + x_5 )

data1 = cbind(y, x_1, x_2, x_3, x_4, x_5)

data1 = ifelse(data1 < 1, 1, data1)
data1 = ifelse(data1 > 20, 20, data1)