Linear Regression Homoskedasticity doubt

44 Views Asked by At

I am trying to plot the Linear Regression between $(X,Y)$ where $X$ is the regressor/predictor and $Y$ is the output. The values are np.random.seed(42)

x = np.random.rand(1, 5000)
y = np.random.rand(1, 5000)

# Find indices where x < y
fav = np.where(x < y)[1]
x = x[0, fav]
y = y[0, fav]

The Linear Regression of $Y$ on $X$ is given by the line $\mathbb{E}(Y|X=x)=(x+1)/2$

coefficient of determination: 0.2512995379407439
intercept: 0.4895868199625202
slope: [0.50666797]

When I want to check the assumptions of Linear regression I find that Residual $\epsilon$ is normally distributed by seeing the density. enter image description here However when I plot the residual against the predicted value to check for presence of Homoskedasticity I get the following graphenter image description here

This graph means that there is Heteroskedasticity and an assumption is violated. But, what does this mean for our model. Should I just rely on this EDA or should I do some statistical tests like Pagan Test, Bartlett.

1

There are 1 best solutions below

0
On

Let's consider the data generation mechanism, $0<X<Y<1$, where $X$ and $Y$ are uniformly distributed.

$\bar{X}=\frac13,\bar{Y}=\frac23$.

We have $Y|X\sim Uniform(X,1)$ where $0<X<1$, hence we have $E(Y|X)=\frac{X+1}2$ and $$Var(Y|X)=\frac{(1-X)^2}{12}.$$

As we can see the variance is a function of $Y|X$ is a function of $X$, here is a plot of the region along with its regression line plus two lines that are 1 standard deviation away.

enter image description here

It affects our inference, i.e. our confidence interval and hypothesis testing. Now that we diagnose the problem, what can we do?

Some possible solutions are presented in this book chapter and also a tutorial in R can be found in this R reference.

If the residuals are uncorrelated, a common strategy is to use weighted least square. Otherwise, we might resort to generalized least square or more complex strategies.