Under what circumstances does the (X′X)^(−1)X′Y equation give bad results when determining the coefficients of a system that is known to be linear?

42 Views Asked by At

I am trying to determine a least squares MLR model for some data using the matrix equation (X′X)-1X′Y and I noticed that the regression lines my program was coming up with were terrible. In order to troubleshoot, I replaced the data with something that was perfectly linear (plus a random error component) and found that the calculated coefficients were still way off. So my question is: under what circumstances can we no longer use this equation to calculate coefficients? (assuming the data is actually linear).

If it helps, I have around 1200 data points, with 52 categorical variables. The representative matrix is sparse because only 3 variables are used for each outcome. For example, a row of data might be:

[0,0,0,1,0,1,...,0,0,1,0,0] ~ [y1]

I replaced the dependent variable data with perfectly linear data by assigning each variable a numerical value and adding those values together with a random number from 0 to 1 to get the Y for that data point. Any sensible coefficient matrix should just give me back my numerical assignments for each variable, but that doesn't happen.

Here is a code snippet:

matrix = csr_matrix((data, (row, col))).toarray()
v = np.ones((len(matrix),1))
matrix = np.c_[v,matrix]
transpose = matrix.transpose()
product = transpose.dot(matrix)
inverse = np.linalg.pinv(product)
coeff = inverse.dot(transpose).dot(y)

This code has been tested against some regression examples I found online and it works fine for those, which leads me to believe this is a mathematical principle I am not understanding rather than a coding issue.