Is it appropriate to use clustering to partition the dependent variable into separate datasets for a home price prediction model?

36 Views Asked by Bumbble Comm At 10 May 2026 - 3:09

I'm struggling to decide how to deal with a heteroskedasticity problem in a home price prediction model I'm developing. The training set residuals are normally distributed around zero, but they have an average absolute error of nearly 20%. As expected, there is lower variability among the observations with low sale values and higher variability among the observations with higher sale values.

Something we are trying to achieve is equity - that is to say, lower value homes are predicted approximately as accurately as higher value homes. Because our current model doesn't do that so well, we are looking for a solution. One thing I though might help would be to use a k-means clustering algorithm on the dependent variable to partition the dataset based on home value - so, we would have three datasets, one each for low, medium, and high values. I was first (and still am) wondering if this algorithm should be applied on the whole dataset before defining training and test sets, or if we should define training and test sets before clustering and then create additional training and test sets based on those clusters. However, now I'm moreso wondering if there are bias issues we need to worry about, and if this is generally an appropriate way to do this, or if there are better ways to address this problem.

For context, this model is a hedonic model using OLS. We've already used spatial clustering to define geographic areas within a broader study area and partitioned the dataset accordingly (our reasoning for doing this as opposed to using indicator variables for the geographic clusters is that the variability in the independent variables with respect to the geographic clusters couldn't possibly be captured using a single binary for each). We are going to also run spatial lag/error models and random forest models to see how they perform, but it's doubtful they'll reduce the absolute error as drastically as we need (we want prediction accuracy of under 10%).

Original Q&A

Is it appropriate to use clustering to partition the dependent variable into separate datasets for a home price prediction model?

Related Questions in STATISTICS

Related Questions in REGRESSION

Related Questions in MACHINE-LEARNING

Related Questions in DATA-ANALYSIS

Related Questions in CLUSTERING

Trending Questions

Popular # Hahtags

Popular Questions