My question pertains to someone else's answered question that has made me curious. The OP wanted to differentiate the following using the chain rule: $$ J(\theta)=\frac12(X\theta-y)^T(X\theta - y) $$ The accepted answer used the Frobenius product with $m\times p$ and $n\times p$ matrices. However, I'm still a bit confused as to why the OP's original approach failed. Specifically, if $\theta$ is an $n\times 1$ vector and $y$ is an $m\times 1$ vector, then: $$ u(\theta)=X\theta-y\\~\\ J(\theta)=\frac12u^Tu\\~\\ \frac{\partial J}{\partial \theta}=\frac{\partial J}{\partial u}\frac{\partial u}{\partial \theta}\\~\\ =\frac12(2u)(X)\\~\\ =(X\theta-y)(X)\\~\\ =X\theta X-yX\\~\\ \not=X^TX\theta-X^Ty\\ $$ I'm obviously missing something when $\theta$ and $y$ remain vectors. Anyone see my mistake? Thanks.
Edit 1:
There are many proofs online that are perfectly fine; I'd just like to know how to correctly apply the chain rule when $\theta$ and $y$ are vectors.
Edit 2:
Being new to vector calculus, I've broken the problem down into summations to shed some light on what's happening: $$ u(\theta)= \begin{bmatrix} x_{11} & x_{12} & \ldots & x_{1n}\\ x_{21} & x_{22} & \ldots & x_{2n}\\ \vdots & \vdots & \ddots & \vdots\\ x_{m1} & x_{m2} & \ldots & x_{mn} \end{bmatrix} \begin{bmatrix} \theta_1\\ \theta_2\\ \vdots\\ \theta_n \end{bmatrix} - \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_m \end{bmatrix}= \begin{bmatrix} \theta_1x_{11} + \theta_2x_{12} + \ldots + \theta_nx_{1n} - y_1\\ \theta_1x_{21} + \theta_2x_{22} + \ldots + \theta_nx_{2n} - y_2\\ \vdots\\ \theta_1x_{m1} + \theta_2x_{m2} + \ldots + \theta_nx_{mn} - y_m \end{bmatrix}\\~\\ u_i=\left(\sum_{j=1}^n\theta_jx_{ij}\right)-y_i\\~\\ J(\theta)=\frac12u^Tu=\frac12\sum_{i=1}^mu_i^2\\~\\ =\frac12\sum_{i=1}^m\left[\left(\sum_{j=1}^n\theta_jx_{ij}\right)-y_i\right]^2\\ $$ If we focus on only one partial derivative: $$ \frac{\partial J}{\partial \theta_1}=\frac12(2)\sum_{i=1}^m\left[x_{i1}\left(\sum_{j=1}^n\theta_jx_{ij}\right)-y_i\right]\\~\\ =\sum_{i=1}^m\left[x_{i1}\left(\sum_{j=1}^n\theta_jx_{ij}\right)-y_i\right]\\ $$ This is equivalent to: $$ \frac{\partial J}{\partial \theta_1}= \begin{bmatrix}x_{11} & x_{21} & \ldots & x_{m1}\end{bmatrix} \left(\begin{bmatrix} x_{11} & x_{12} & \ldots & x_{1n}\\ x_{21} & x_{22} & \ldots & x_{2n}\\ \vdots & \vdots & \ddots & \vdots\\ x_{m1} & x_{m2} & \ldots & x_{mn} \end{bmatrix} \begin{bmatrix} \theta_1\\ \theta_2\\ \vdots\\ \theta_n \end{bmatrix} - \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_m \end{bmatrix}\right)\\~\\~\\ =x_1^T(X\theta - y) $$ And similarly, for all the partial derivatives: $$ \frac{\partial J}{\partial \theta}=X^T(X\theta - y)\\~\\ =X^TX\theta-X^Ty\\ $$ The interesting part is that when I perform the differentiation this way, the $\frac{\partial u}{\partial \theta}$ term correctly ends up on the left, and is transposed. However, when I tried to use the chain rule directly, it incorrectly ends up on the right, not transposed: $$ \frac{\partial J}{\partial \theta}=X^T(X\theta - y)\\~\\ \frac{\partial J}{\partial \theta}\not=(X\theta-y)X\\ $$ Does anyone see where I am going wrong with the chain rule? Any insight is much appreciated!
The error is that what you are calling $\frac{\partial J}{\partial u}$ is equal to $u^T$, not $u$.
With that correction, here's how the calculation goes. Let $g(u) = \frac12 u^T u = \frac12 \| u \|^2$ and let $h(\theta) = X \theta - y$. Note that $g'(u) = u^T$ and $h'(\theta) = X$. The function $J$ is the composition of $g$ and $h$: $$ J(\theta) = g(h(\theta)). $$ By the chain rule, \begin{align} J'(\theta) &= g'(h(\theta)) h'(\theta) \\ &= (X \theta - y)^T X. \end{align} The gradient of $J$ is $$ \nabla J(\theta) = J'(\theta)^T = X^T (X \theta - y). $$
Background knowledge: If $F: \mathbb R^n \to \mathbb R^m$ is differentiable at a point $x \in \mathbb R^n$, then $F'(x)$ is an $m \times n$ matrix which satisfies $$ F(x + \Delta x) \approx F(x) + F'(x) \Delta x. $$ The approximation is good when $\Delta x \in \mathbb R^n$ is small. This local linear approximation is the key idea of calculus.
If $F:\mathbb R^n \to \mathbb R$, then $F'(x)$ is a $1 \times n$ matrix (a row vector). The gradient of $F$ at $x$ is the column vector $$ \nabla F(x) = F'(x)^T. $$