I am trying to plot the Linear Regression between $(X,Y)$ where $X$ is the regressor/predictor and $Y$ is the output. The values are np.random.seed(42)
x = np.random.rand(1, 5000)
y = np.random.rand(1, 5000)
# Find indices where x < y
fav = np.where(x < y)[1]
x = x[0, fav]
y = y[0, fav]
The Linear Regression of $Y$ on $X$ is given by the line $\mathbb{E}(Y|X=x)=(x+1)/2$
coefficient of determination: 0.2512995379407439
intercept: 0.4895868199625202
slope: [0.50666797]
When I want to check the assumptions of Linear regression I find that Residual $\epsilon$ is normally distributed by seeing the density.
However when I plot the residual against the predicted value to check for presence of Homoskedasticity I get the following graph
This graph means that there is Heteroskedasticity and an assumption is violated. But, what does this mean for our model. Should I just rely on this EDA or should I do some statistical tests like Pagan Test, Bartlett.
Let's consider the data generation mechanism, $0<X<Y<1$, where $X$ and $Y$ are uniformly distributed.
$\bar{X}=\frac13,\bar{Y}=\frac23$.
We have $Y|X\sim Uniform(X,1)$ where $0<X<1$, hence we have $E(Y|X)=\frac{X+1}2$ and $$Var(Y|X)=\frac{(1-X)^2}{12}.$$
As we can see the variance is a function of $Y|X$ is a function of $X$, here is a plot of the region along with its regression line plus two lines that are 1 standard deviation away.
It affects our inference, i.e. our confidence interval and hypothesis testing. Now that we diagnose the problem, what can we do?
Some possible solutions are presented in this book chapter and also a tutorial in R can be found in this R reference.
If the residuals are uncorrelated, a common strategy is to use weighted least square. Otherwise, we might resort to generalized least square or more complex strategies.