I am wondering if someone could elaborate on the vectorized partial derivative of the MSE cost function. When writing code, I noticed that there seemed to be something wrong with the partial derivative terms that the class was outputting. I used the following formulas:
$$\frac{\partial J(\theta)}{\partial \theta}=\frac{1}{m}X^T(X\theta-y) \in R^{n+1}$$ $$\theta := \theta-\alpha (\frac{\partial J(\theta)}{\partial \theta})$$
Below, I have attached my work through the first iteration, where $\theta_0$ increases despite already being at an ideal value (0). If someone could explain to me where my error is, either in the assumption that $J(\theta)$ always decreases or in my math/formulas. Thanks.

The first coordinate need not stay at $0$ and in this case, will not stay at $0$ since the coordinate of the gradient is non-zero. It will still converge back to $0$ if the step size is chosen carefully.
We have $\theta_0 = \begin{bmatrix} 0 \\ 0 \end{bmatrix}$.
$\theta_1 = \theta_0 - \alpha\begin{bmatrix} -\frac52 \\ -\frac{15}2 \end{bmatrix}$
In general,
\begin{align}\theta_{k+1} &= \theta_k -\alpha X^T(X\theta_k - y) \\ &=(I-\alpha X^TX)\theta_k + \alpha X^Ty \\ &= \begin{bmatrix} 1-4\alpha & -10\alpha \\ -10\alpha & 1-30\alpha \end{bmatrix} \theta_k +\alpha \begin{bmatrix} 10 \\ 30\end{bmatrix}\end{align}
We have to choose the step size carefully such that it converges. Let me pick $\alpha =0.01$ since the spectral radius is less than $1$.
Using the following Python code, we can see the progress:
and we get the following output: