Weighted Least Squares - Categorical Data vs. Numerical Data

333 Views Asked by Bumbble Comm At 09 Apr 2026 - 6:04

Over on Stackoverflow, I am trying calculate the Weighted Least Squares (WLS) of a data set in a python library called Numpy as compared to using a library called Statsmodels. However, I noticed something very mysterious.

I have discovered that computing the WLS on numerical data vs. categorical data yields a completely different line of best fit. I understand that in the categorical example of male vs. female, with the values zero (0) and one (1) being assigned respectively, it is wrong to compare the dummy variables of the male and female category groups (e.g. male < female because 0 < 1) as we could have chosen any numbers for these groups, but...

How does the math change? Why do you get a different line of best fit when just treating everything as numerical data than when having categorical data?

As reported here on slide 2, the WLS estimates for the parameters of the line of best fit are found by minimizing the weighted Residual Sum of Squares (RSS):

$$\sum_{i=1}^{n} w_i(y_i-[\beta_0+\beta_1*x_i])^2 \\ parameters=\beta_0,\beta_1 \\ weights=\vec{w} \\ input=\vec{x} \\ output=\vec{y}$$

How does this change with categorical data? I am extremely confused about the math behind including categorical data in the calculations. Does anyone have any thoughts on this?

Original Q&A

Weighted Least Squares - Categorical Data vs. Numerical Data

Related Questions in STATISTICS

Related Questions in LEAST-SQUARES

Related Questions in WEIGHTED-LEAST-SQUARES

Trending Questions

Popular # Hahtags

Popular Questions