Interpretation of regression formula returned by computer software

30 Views Asked by At

I have a dataset consisting of 744 records. Data exploring software generated an equation I don't know how to interpret in simple words. I really appreciate if you could help me about this matter. This dataset consists of two variables x and y values ranging from 0 to 1. Software gives me two equations:

y = 0.929307*x+normal(0.0305467, 0.0435136)
x = normal(0.4767,0.254105)

Additional information about variables:

mean of y = 0.480593
min of y = -0.351811
max of y = 1.30236
stddev of y = 0.237672
mean of x = 0.483732
min of x = -0.397034
max of x = 1.34522
stddev of x = 0.25316

It seems ununderstandable that minimum and maximum of x and y are less than 0 and more than 1. This dataset does not include values lower than 0 or greater than 1. How to explain that?

1

There are 1 best solutions below

0
On

It looks like your software is (implicitly?) assuming that your data are normal distributed and is then trying to estimate parameters for a pair of normal distributed random variables that would looks like your dataset.

Since a normal distribution doesn't have a true maximum and minimum, the software doesn't know that 0 and 1 are supposed to be special values. Instead it's computing "maximum" and "minimum" artificially by adding/subtracting 3.45 standard deviations from the mean. This produces an interval such that the probablity of a value from the normal distribution being outside the interval is $\frac{1}{1,000,000}$.

What you should be noticing here is probably that the standard deviation of the difference between $X$ and $Y$ is quite small at $0.0435$. That means that your data points fit the line $$ Y = 0.929307\cdot X + 0.0305467$$ quite well (except that those coefficients have too many significant digits to be really meaningful -- you should probably just call it $Y=0.93\,X +0.03$).

A standard deviation of $0.25$ with mean close to $\frac12$ for the $X$ variable might lead one to suspect that $X$ is not really normal distributed but closer to being uniformly distributed in $[0,1]$. (A true uniform distribution would have standard deviation $0.289$). Whether that makes any sense would depend on details about your dataset that you're not showing here, however.