why does linear regression give a good result here?

97 Views Asked by At

In Excel I created some data. The variables are

Time
Length
Width
height
ChocolateQuality
price

The response variable is Price. I used the following formula to generate the values for price:

price =  B2*C2*D2*5  +  2*E2  +  SIN(A2/3.14)
or
price = Length*Width*height*5  +  2*ChocolateQuality  +  sin(Time/3.14)

I am puzzled that using linear regression on this data creates such a high Multiple R (0.969034397). The price variable is not linearly related to Length, Width, or height so I would not expect such a good result. Can anyone explain why this works?

I was trying to show that the creation of a new variable, the product of height, width, and length, should improve the regression results (and it does raise it to .99) but I didn't expect the result without the creation of the new variable to be so high. Regarding the input data I used formulas for length =4+3RAND(), Width = =3+2RAND(), height =2+1*RAND(), and ChocoQuality =RANDBETWEEN(1,4). RAND() is an Excel function that generates random numbers betweein -1 and +1. Randbetween generates integers between 1 and 4 (in this case). I created 325 rows of data in this way.

Data looks like this but has 325 rows:

Length  Width   height  ChololateQuality    Price
6.584903273 3.724123946 2.729031731 1   336.6201706
6.926510813 4.805111112 2.810795911 1   468.8262491
4.229677111 3.746150844 2.00365992  2   163.3347825
5.621000949 4.869836742 2.190494881 3   306.4677664
6.245666039 3.141578124 2.520717921 2   251.6919095
4.866733045 3.342057324 2.609198462 2   215.3510926
4.687606927 4.779600152 2.6980397   1   304.8411104
1

There are 1 best solutions below

8
On BEST ANSWER

The result is not surprising.

A linear approximation of the volume $V=xyz$, given a small variation of $x,y,z$ is:

$$\mathrm dV=\mathrm dx\cdot y\cdot z+x\cdot \mathrm dy\cdot z+x\cdot y\cdot \mathrm dz$$

Here we have variations of $x,y,z$ respectively around the means $6.5, 4.5, 2.5$, and the variations don't exceed 0.5. And we multiply the volume by $5$ in the formula. So we have the approximate relationship:

$$\mathrm{price}\simeq5\mathrm V\simeq K+(5\times 4.5\times 2.5) x +(5\times 6.5\times 2.5) y+(5\times 6.5\times 4.5) z$$

Where the constant $K$ accounts for the mean $V$ as well as the means of $x,y,z$ and other random variables that enter the model.

The regression coefficients should therefore be rather close to $56.25$, $81.25$ and $146.25$ respectively. And it's indeed what is observed on my tests in R. And owing to the near-linear relationship, the $R^2$ is indeed high.