In Excel I created some data. The variables are
Time
Length
Width
height
ChocolateQuality
price
The response variable is Price. I used the following formula to generate the values for price:
price = B2*C2*D2*5 + 2*E2 + SIN(A2/3.14)
or
price = Length*Width*height*5 + 2*ChocolateQuality + sin(Time/3.14)
I am puzzled that using linear regression on this data creates such a high Multiple R (0.969034397). The price variable is not linearly related to Length, Width, or height so I would not expect such a good result. Can anyone explain why this works?
I was trying to show that the creation of a new variable, the product of height, width, and length, should improve the regression results (and it does raise it to .99) but I didn't expect the result without the creation of the new variable to be so high. Regarding the input data I used formulas for length =4+3RAND(), Width = =3+2RAND(), height =2+1*RAND(), and ChocoQuality =RANDBETWEEN(1,4). RAND() is an Excel function that generates random numbers betweein -1 and +1. Randbetween generates integers between 1 and 4 (in this case). I created 325 rows of data in this way.
Data looks like this but has 325 rows:
Length Width height ChololateQuality Price
6.584903273 3.724123946 2.729031731 1 336.6201706
6.926510813 4.805111112 2.810795911 1 468.8262491
4.229677111 3.746150844 2.00365992 2 163.3347825
5.621000949 4.869836742 2.190494881 3 306.4677664
6.245666039 3.141578124 2.520717921 2 251.6919095
4.866733045 3.342057324 2.609198462 2 215.3510926
4.687606927 4.779600152 2.6980397 1 304.8411104
The result is not surprising.
A linear approximation of the volume $V=xyz$, given a small variation of $x,y,z$ is:
$$\mathrm dV=\mathrm dx\cdot y\cdot z+x\cdot \mathrm dy\cdot z+x\cdot y\cdot \mathrm dz$$
Here we have variations of $x,y,z$ respectively around the means $6.5, 4.5, 2.5$, and the variations don't exceed 0.5. And we multiply the volume by $5$ in the formula. So we have the approximate relationship:
$$\mathrm{price}\simeq5\mathrm V\simeq K+(5\times 4.5\times 2.5) x +(5\times 6.5\times 2.5) y+(5\times 6.5\times 4.5) z$$
Where the constant $K$ accounts for the mean $V$ as well as the means of $x,y,z$ and other random variables that enter the model.
The regression coefficients should therefore be rather close to $56.25$, $81.25$ and $146.25$ respectively. And it's indeed what is observed on my tests in R. And owing to the near-linear relationship, the $R^2$ is indeed high.