I wanted to use RSS as loss function for a linear regression problem. I know that MSE is used normally for linear regression, but I wanted to test RSS and check my understanding of computing the gradients for stochastic gradient decent.
The gradient of RSS is
$$\begin{align}
\dfrac{\partial \text{RSS}}{\partial \theta}=-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\theta)\text{.}
\end{align}$$
And I use stochastic gradient decent (SGD) to update $\theta$:
$$\theta_{i+1} = \theta_{i} - \epsilon*\dfrac{\partial \text{RSS}}{\partial \theta}$$ where $\epsilon$ is my learning rate.
I have implemented this using a sample database of scikit-learn: fetch_california_housing. But for some reason this gradient does not improve RSS but increases RSS until it reaches infinity.
The code is written in Python using Numpy. The main part is the for loop:
import numpy as np
from sklearn.datasets import fetch_california_housing
import matplotlib.pyplot as plt
# read the database
housing = fetch_california_housing()
# seperate X and y
data = housing['data']
targets = housing.target.reshape(-1,1)
# standardize data
standardized_housing_data = (data - np.mean(data))/np.std(data)
# devide in train and val set
trainset = standardized_housing_data[:int(0.75*len(standardized_housing_data)),:]
valset = standardized_housing_data[int(0.75*len(standardized_housing_data)):,:]
y_train = targets[:int(0.75*len(standardized_housing_data)),:]
y_val = targets[int(0.75*len(standardized_housing_data)):,:]
# run SGD with RSS loss function
lr = 0.0001
epochs = 500
m,n = trainset.shape
m_val, n_val = valset.shape
X = np.c_[np.ones((m,1)),trainset]
y_train = y_train.reshape(-1,1)
theta = np.random.uniform(-1,1,(n+1,1))
for i in range(epochs):
# here is the main part
y_pred = X@theta # compute y
error = y_train - y_pred # compute error
RSS = np.sum(np.square(error)) # compute RSS
gradients = (-2)* XT@(error) # compute Gradient of RSS
theta = theta - lr*gradients # update theta
print('Epoch: ', i, ' RSS: ', RSS)
As you can see, I am outputing the RSS of each iteration. The output is:
Epoch: 0 RSS: 62335.37853901445
Epoch: 1 RSS: 1601265.8854023307
Epoch: 2 RSS: 1052093835.5520192
Epoch: 3 RSS: 720335492107.2932
Epoch: 4 RSS: 493223770672591.75
Epoch: 5 RSS: 3.377172389891606e+17
Epoch: 6 RSS: 2.312397339932849e+20
Epoch: 7 RSS: 1.5833309172297123e+23
Epoch: 8 RSS: 1.0841289038709424e+26
Epoch: 9 RSS: 7.423182781429204e+28
Epoch: 10 RSS: 5.082757447915687e+31
...
Epoch: 106 RSS: 8.22215327892994e+303
Epoch: 107 RSS: 5.629823762515433e+306
Epoch: 108 RSS: inf
Epoch: 109 RSS: inf
Can anyone tell me what the problem is? Do I compute the gradient wrongly? Is this a numerical problem?