Least Squares: Derivation of Normal Equations with Chain Rule (Revisited)

166 Views Asked by At

My question pertains to someone else's answered question that has made me curious. The OP wanted to differentiate the following using the chain rule: $$ J(\theta)=\frac12(X\theta-y)^T(X\theta - y) $$ The accepted answer used the Frobenius product with $m\times p$ and $n\times p$ matrices. However, I'm still a bit confused as to why the OP's original approach failed. Specifically, if $\theta$ is an $n\times 1$ vector and $y$ is an $m\times 1$ vector, then: $$ u(\theta)=X\theta-y\\~\\ J(\theta)=\frac12u^Tu\\~\\ \frac{\partial J}{\partial \theta}=\frac{\partial J}{\partial u}\frac{\partial u}{\partial \theta}\\~\\ =\frac12(2u)(X)\\~\\ =(X\theta-y)(X)\\~\\ =X\theta X-yX\\~\\ \not=X^TX\theta-X^Ty\\ $$ I'm obviously missing something when $\theta$ and $y$ remain vectors. Anyone see my mistake? Thanks.


Edit 1:

There are many proofs online that are perfectly fine; I'd just like to know how to correctly apply the chain rule when $\theta$ and $y$ are vectors.
Edit 2:

Being new to vector calculus, I've broken the problem down into summations to shed some light on what's happening: $$ u(\theta)= \begin{bmatrix} x_{11} & x_{12} & \ldots & x_{1n}\\ x_{21} & x_{22} & \ldots & x_{2n}\\ \vdots & \vdots & \ddots & \vdots\\ x_{m1} & x_{m2} & \ldots & x_{mn} \end{bmatrix} \begin{bmatrix} \theta_1\\ \theta_2\\ \vdots\\ \theta_n \end{bmatrix} - \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_m \end{bmatrix}= \begin{bmatrix} \theta_1x_{11} + \theta_2x_{12} + \ldots + \theta_nx_{1n} - y_1\\ \theta_1x_{21} + \theta_2x_{22} + \ldots + \theta_nx_{2n} - y_2\\ \vdots\\ \theta_1x_{m1} + \theta_2x_{m2} + \ldots + \theta_nx_{mn} - y_m \end{bmatrix}\\~\\ u_i=\left(\sum_{j=1}^n\theta_jx_{ij}\right)-y_i\\~\\ J(\theta)=\frac12u^Tu=\frac12\sum_{i=1}^mu_i^2\\~\\ =\frac12\sum_{i=1}^m\left[\left(\sum_{j=1}^n\theta_jx_{ij}\right)-y_i\right]^2\\ $$ If we focus on only one partial derivative: $$ \frac{\partial J}{\partial \theta_1}=\frac12(2)\sum_{i=1}^m\left[x_{i1}\left(\sum_{j=1}^n\theta_jx_{ij}\right)-y_i\right]\\~\\ =\sum_{i=1}^m\left[x_{i1}\left(\sum_{j=1}^n\theta_jx_{ij}\right)-y_i\right]\\ $$ This is equivalent to: $$ \frac{\partial J}{\partial \theta_1}= \begin{bmatrix}x_{11} & x_{21} & \ldots & x_{m1}\end{bmatrix} \left(\begin{bmatrix} x_{11} & x_{12} & \ldots & x_{1n}\\ x_{21} & x_{22} & \ldots & x_{2n}\\ \vdots & \vdots & \ddots & \vdots\\ x_{m1} & x_{m2} & \ldots & x_{mn} \end{bmatrix} \begin{bmatrix} \theta_1\\ \theta_2\\ \vdots\\ \theta_n \end{bmatrix} - \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_m \end{bmatrix}\right)\\~\\~\\ =x_1^T(X\theta - y) $$ And similarly, for all the partial derivatives: $$ \frac{\partial J}{\partial \theta}=X^T(X\theta - y)\\~\\ =X^TX\theta-X^Ty\\ $$ The interesting part is that when I perform the differentiation this way, the $\frac{\partial u}{\partial \theta}$ term correctly ends up on the left, and is transposed. However, when I tried to use the chain rule directly, it incorrectly ends up on the right, not transposed: $$ \frac{\partial J}{\partial \theta}=X^T(X\theta - y)\\~\\ \frac{\partial J}{\partial \theta}\not=(X\theta-y)X\\ $$ Does anyone see where I am going wrong with the chain rule? Any insight is much appreciated!

1

There are 1 best solutions below

2
On BEST ANSWER

The error is that what you are calling $\frac{\partial J}{\partial u}$ is equal to $u^T$, not $u$.


With that correction, here's how the calculation goes. Let $g(u) = \frac12 u^T u = \frac12 \| u \|^2$ and let $h(\theta) = X \theta - y$. Note that $g'(u) = u^T$ and $h'(\theta) = X$. The function $J$ is the composition of $g$ and $h$: $$ J(\theta) = g(h(\theta)). $$ By the chain rule, \begin{align} J'(\theta) &= g'(h(\theta)) h'(\theta) \\ &= (X \theta - y)^T X. \end{align} The gradient of $J$ is $$ \nabla J(\theta) = J'(\theta)^T = X^T (X \theta - y). $$


Background knowledge: If $F: \mathbb R^n \to \mathbb R^m$ is differentiable at a point $x \in \mathbb R^n$, then $F'(x)$ is an $m \times n$ matrix which satisfies $$ F(x + \Delta x) \approx F(x) + F'(x) \Delta x. $$ The approximation is good when $\Delta x \in \mathbb R^n$ is small. This local linear approximation is the key idea of calculus.

If $F:\mathbb R^n \to \mathbb R$, then $F'(x)$ is a $1 \times n$ matrix (a row vector). The gradient of $F$ at $x$ is the column vector $$ \nabla F(x) = F'(x)^T. $$