Am working with linear regression $$\begin{bmatrix} y_1 \\ y_2 \\ ... \\ y_n \end{bmatrix} \approx \begin{bmatrix} x_{11} & x_{12} & ... & x_{1n} \\ x_{21} & x_{22} & ... & x_{2n} \\ .. & .. & ... \\ x_{m1} & x_{m2} & ... & x_{mn} \end{bmatrix} \begin{bmatrix} a_1 \\ a_2 \\ ... \\ a_n \end{bmatrix} + \begin{bmatrix} b_1 \\ b_2 \\ ... \\ b_n \end{bmatrix} $$ My loss function is:
$$ \frac{1}{2}\frac{1}{N}\sum_{i=0}^{N-1} (y_i - (x_{ij} a_j +bj) )^2$$
This function return vector consisting of j elements. In my gradient descent algorithm each iteration I call the norm of the vector returned by the loss function. And then to show that I minimize the loss function I plot number of iterations and value of norm for a given iteration. This is a good idea or there is there a better way to show minimalization of loss function?
Quick look
loss_arr = []
def grad_des_lin(y,x,w,bias,step,epos):
for u in range(epos):
# minimalize bias and w
# .....
# and then calculated loss for this iteration
loss_arr.append(np.linalg.norm(los_fun(y,x,w,bias)))
grad_des_lin(y,x,w,bias,step,epos)
plt.plot(range(epos), loss_arr)
It doesn't make sense to compute the norm of your loss function for a particular couple of input/output $(x,y)$. Rather, you should compute the mean of these norms on a set $S$ of input/output couples.
I suggest that you take $S$ to be a (fixed) randomly selected subset of your training data. After each step, you can append mean_los_fun(S,w,bias) to loss_arr, here's a possible definition of mean_los_fun(S,w,bias) :