I am given the following equation: $$RSS(B, \alpha) = \sum_{i=1}^{N} (y_{i} - B^{T}x_{i} - \alpha)^{2} $$
My steps are as follows. I provided the images to my work at the bottom and this is just a description of my thought process.
- I define $\lambda_{i} = y_{i} - B^{T}x_{i} - \alpha$
- I then notice that $\lambda^{T}\lambda = \lambda_{1}^{2} + \lambda_{2}^{2} + \ldots + \lambda_{n}^{2}$
- Since I defined earlier that $\lambda_{i} = y_{i} - B^{T}x_{i} - \alpha$, I can substitute that back in and I am now able to represent the original summation.
- I then look at the vector $\lambda$ and I notice that each individual element is: $$\begin{bmatrix}y_{1} - B^{T}x_{2} - \alpha \\ y_{2} - B^{T}x_{2} - \alpha \\ \vdots \\ y_{n} - B^{T}x_{n} - \alpha\end{bmatrix}$$ and that can be broken down into 3 different column vectors $Y, B^{T}X,\overline{\alpha}$
- So $\lambda = Y - B^{T}X - \overline{\alpha}$ and I get that: $$RSS(B, \alpha) = (Y - B^{T}X - \overline{\alpha})^{T}(Y - B^{T}X - \overline{\alpha})$$
- After distributing the transpose and multiplying everything out, once I take the partial derivative with respect to $B$ and $\alpha$ I get the following two equations.
$$\frac{\partial}{\partial B} = X^{T}Y - X^{T}XB - X^{T}\overline{\alpha}$$ $$\frac{\partial}{\partial \alpha} = Y - B^{T}X - \overline{\alpha}$$
but after solving for $B$ and $\alpha$ I get that $0 = 0$ and I am lost on how to continue from there. I know there is something wrong with my math but I am having trouble identifying it. I never learned how to take derivatives of matrices in school so I am basing all my knowledge on http://www.gatsby.ucl.ac.uk/teaching/courses/sntn/sntn-2017/resources/Matrix_derivatives_cribsheet.pdf
$\def\p{\partial}$ Let's use a naming convention where an uppercase Latin letter denotes a matrix, lowercase Latin a column vector, and lowercase Greek a scalar.
Then define the variables $$\eqalign{ J &= {\tt11}^T &\quad\big({\rm All\,Ones\,Matrix}\big)\\ C &= I-\tfrac 1nJ &\quad\big({\rm Centering\,Matrix}\big) \\ M &= X^+C \\ X &= \big[\,x_1\;x_2\;\ldots\;x_n\big]^T \\ b &= B \\ w &= Xb + \alpha{\tt1} - y &\quad\big({\rm Residual\,Vector}\big) \\ }$$ where $X^+$ is the Moore-Penrose inverse of $X$.
Write the RSS function in terms of these new variables and calculate its differential. $$\eqalign{ \rho &= w^Tw \\ d\rho &= 2w^Tdw \\ &= 2w^T(X\,db + {\tt1}\,d\alpha) \\ }$$ Holding $b$ constant (so that $db=0$) yields the gradient with respect to $\,\alpha$. $$\eqalign{ d\rho &= 2w^T{\tt1}\,d\alpha \\ \frac{\p \rho}{\p \alpha} &= 2(w^T{\tt1}) = 2({\tt1}^Tw) \\ &= 2\left({\tt1}^TXb +n\alpha -{\tt1}^Ty\right) \\ }$$ Set this gradient to zero and solve for the optimal $\alpha$. $$\eqalign{ \alpha &= \tfrac 1n\,{\tt1}^T\big(y-Xb\big) \\ \alpha{\tt1} &= \tfrac 1n\,J\big(y-Xb\big) \;=\; (I-C)\,(y-Xb) \\ y-\alpha{\tt1} &= Cy + (I-C)Xb \\ }$$ Similarly, holding $\alpha$ constant yields the gradient with respect to $b$. $$\eqalign{ d\rho &= 2w^TX\,db = 2(X^Tw)^Tdb \\ \frac{\p \rho}{\p b} &= 2X^Tw = 2\big(X^TXb +\alpha X^T{\tt1} -X^Ty\big) \\ }$$ Set the gradient to zero and solve for the optimal $b$. $$\eqalign{ &X^TXb = X^T(y-\alpha{\tt1}) \\ &b = X^+(y-\alpha{\tt1}) \\ &b = X^+\Big(Cy - (C-I)Xb\Big) \\ &\Big(I+X^+(C-I)X\Big)b = X^+Cy \\ &\Big(X^+CX\Big)b = X^+Cy \\ &b = \big(X^+CX\big)^+X^+Cy \\ }$$ The following parameter combinations will be very useful. $$\eqalign{ Xb &= \big(X^+C\big)^+\big(X^+C\big)y &\doteq\; M^+My \\ \alpha{\tt1} &= (I-C)\,(y-Xb) &\doteq (I-C)(I-M^+M)y \\ }$$ Substituting the optimal parameter values yields $$\eqalign{ w &= Xb +\color{red}{\alpha{\tt1}} -y \\ &= M^+My +\color{red}{(I-C)y +(C-I)M^+My} -y \\ &= C(M^+M-I)\,y \\ \rho &= w^Tw \\ &= y^T(M^+M-I)^TC^TC(M^+M-I)\,y \\ &= y^T\left(I-M^+M\right)C\left(I-M^+M\right)y \\ \\ }$$
So that's how you would solve the problem if you treat $\alpha$ as a separate variable. But what most people do instead is use augmented variables by prepending ${\tt1}$ to each $x_k$ vector and prepending $\alpha$ to the $b$ vector.
Then the algebra becomes much simpler, i.e.
$$\eqalign{ X &= \big[\,\hat x_1\;\hat x_2\;\ldots\;\hat x_n\big]^T,\qquad \hat x_k = \pmatrix{{\tt1}\\x_k},\qquad b = \pmatrix{\alpha\\B} \\ w &= Xb - y \\ \rho &= w^Tw \\ d\rho &= 2(X^Tw)^Tdb \\ \frac{\p \rho}{\p b} &= 2X^T(Xb-y) = 0 \\ b &= X^+y \\ w &= (XX^+-I)y \\ \rho &= y^T(I-XX^+)^T(I-XX^+)y \\ &= y^T(I-XX^+)y \\ }$$