Using RSS as loss function for linear regression does not work

392 Views Asked by At

I wanted to use RSS as loss function for a linear regression problem. I know that MSE is used normally for linear regression, but I wanted to test RSS and check my understanding of computing the gradients for stochastic gradient decent.
The gradient of RSS is $$\begin{align} \dfrac{\partial \text{RSS}}{\partial \theta}=-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\theta)\text{.} \end{align}$$ And I use stochastic gradient decent (SGD) to update $\theta$: $$\theta_{i+1} = \theta_{i} - \epsilon*\dfrac{\partial \text{RSS}}{\partial \theta}$$ where $\epsilon$ is my learning rate.
I have implemented this using a sample database of scikit-learn: fetch_california_housing. But for some reason this gradient does not improve RSS but increases RSS until it reaches infinity. The code is written in Python using Numpy. The main part is the for loop:

import numpy as np 
from sklearn.datasets import fetch_california_housing
import matplotlib.pyplot as plt
# read the database
housing = fetch_california_housing()

# seperate X and y
data = housing['data']
targets = housing.target.reshape(-1,1)

# standardize data 
standardized_housing_data = (data - np.mean(data))/np.std(data)

# devide in train and val set
trainset = standardized_housing_data[:int(0.75*len(standardized_housing_data)),:]
valset = standardized_housing_data[int(0.75*len(standardized_housing_data)):,:]
y_train =  targets[:int(0.75*len(standardized_housing_data)),:]
y_val = targets[int(0.75*len(standardized_housing_data)):,:]



# run SGD with RSS loss function
lr = 0.0001
epochs = 500

m,n = trainset.shape
m_val, n_val = valset.shape

X = np.c_[np.ones((m,1)),trainset]
y_train = y_train.reshape(-1,1)
theta = np.random.uniform(-1,1,(n+1,1))

for i in range(epochs):
    # here is the main part
    y_pred = X@theta                     # compute y
    error = y_train - y_pred             # compute error
    RSS = np.sum(np.square(error))       # compute RSS
    gradients = (-2)* XT@(error)         # compute Gradient of RSS
    theta = theta - lr*gradients         # update theta
    print('Epoch: ', i, ' RSS: ', RSS)

As you can see, I am outputing the RSS of each iteration. The output is:

Epoch:  0  RSS:  62335.37853901445
Epoch:  1  RSS:  1601265.8854023307
Epoch:  2  RSS:  1052093835.5520192
Epoch:  3  RSS:  720335492107.2932
Epoch:  4  RSS:  493223770672591.75
Epoch:  5  RSS:  3.377172389891606e+17
Epoch:  6  RSS:  2.312397339932849e+20
Epoch:  7  RSS:  1.5833309172297123e+23
Epoch:  8  RSS:  1.0841289038709424e+26
Epoch:  9  RSS:  7.423182781429204e+28
Epoch:  10  RSS:  5.082757447915687e+31
...
Epoch:  106  RSS:  8.22215327892994e+303
Epoch:  107  RSS:  5.629823762515433e+306
Epoch:  108  RSS:  inf
Epoch:  109  RSS:  inf

Can anyone tell me what the problem is? Do I compute the gradient wrongly? Is this a numerical problem?