I would like to generate a good big data set for a linear regression exercise for my students. We will try to explain the salary (continuous variable) of 50 people in terms of their results high school results (/20, we suppose that those are continuous) in 5 subjects, for example maths, physics, chemistry, English, German.
What I mean by a good data set is:
-It satisfies usual conditions for linear regression
-It has some logic to it, that is if you're good in maths, you're likely to be good in physics , and not bad in chemistry. If you're extremely good in a few subjects, then you'll unlikely fail an exam (<10/20).
-Not obvious, so students can see the importance of the linear regression. What I mean is that if I generate 1 vector of salary, 5 vectors of results, and for each vector, order it from the smallest value, to the highest and put all of them in a matrix, we don't really need linear regression to see a pattern, it is too obvious.
Does a tool exists for such generation ? How can one proceed ? Preferably in R or matlab.
You can start with something like this and then fine tune the parameters to get the "perfect" data set