Weighted Least Squares - Categorical Data vs. Numerical Data

322 Views Asked by At

Over on Stackoverflow, I am trying calculate the Weighted Least Squares (WLS) of a data set in a python library called Numpy as compared to using a library called Statsmodels. However, I noticed something very mysterious.

I have discovered that computing the WLS on numerical data vs. categorical data yields a completely different line of best fit. I understand that in the categorical example of male vs. female, with the values zero (0) and one (1) being assigned respectively, it is wrong to compare the dummy variables of the male and female category groups (e.g. male < female because 0 < 1) as we could have chosen any numbers for these groups, but...

How does the math change? Why do you get a different line of best fit when just treating everything as numerical data than when having categorical data?

As reported here on slide 2, the WLS estimates for the parameters of the line of best fit are found by minimizing the weighted Residual Sum of Squares (RSS):

$$\sum_{i=1}^{n} w_i(y_i-[\beta_0+\beta_1*x_i])^2 \\ parameters=\beta_0,\beta_1 \\ weights=\vec{w} \\ input=\vec{x} \\ output=\vec{y}$$

How does this change with categorical data? I am extremely confused about the math behind including categorical data in the calculations. Does anyone have any thoughts on this?