Given matrix $A \in \mathbb R^{m \times n}$ and vector $y \in \mathbb R^m$, I want to take the gradient of the following scalar field with respect to $x\in \mathbb R^n$.
$$x \mapsto \big((Ax - y)^T(Ax - y) \big),$$
$\textbf{Attempt}.$ \begin{align} \frac{\partial}{\partial x} \big((Ax - y)^T(Ax - y) \big) &= \frac{\partial}{\partial x} \big( (x^TA^TAx - x^TA^Ty - y^TAx+ y^Ty )\big)\\ &= \frac{\partial}{\partial x}x^TA^TAx - \frac{\partial}{\partial x}x^TA^Ty - \frac{\partial}{\partial x}y^TAx+ \frac{\partial}{\partial x}y^Ty \\ &= 2 A^TAx - A^Ty - y^TA\qquad\,\,\mathbf{(1*)}\\ &= 2 A^TAx - 2A^Ty. \qquad\qquad\,\mathbf{(2*)}\\ \end{align}
$\textbf{Question}.$ There are two expressions above marked by $(*)$. I don't understand the justification in going from $(1*)$ to $(2*)$ (in fact, the dimensions don't make sense...), which makes me think that there is a mistake in $(1*)$. Can someone explain the basics involved in these matrix manipulations?
I have explicitly written out some cases by hand, and here are results, starting with the obvious:
$x$ is $(nx1)$, $y$ is $(mx1)$,
$x^T$ is $(1xn)$, $y^T$ is $(1xm)$
$A$ is $(mxn)$, $A^T$ is $(nxm)$
$A^TA$ is $(nxn)$, $A^TAx$ is $(nx1)$, and $x^TA^TAx$ is a scalar.
$A^Ty$ is $(nx1)$, $x^TA^Ty$ is a scalar, $y^TA$ is $(1xn)$, $y^TAx$ is a scalar.
Taking the partial derivative of a scalar s with respect to the vector $x$ means to create a column vector $$\begin{bmatrix}\frac{\partial s}{\partial x_1} \\\frac{\partial s}{\partial x_2} \\ ...\\ \frac{\partial s}{\partial x_n}\end{bmatrix} $$
Writing them out specifically, finding the various scalars, taking their partial derivatives, and recognizing the results, I find $$\partial/\partial x(x^TA^TAx) = 2A^TAx$$ $$\partial/\partial x (x^TA^Ty) = A^Ty$$. which surprised me. $$\partial/\partial x (y^TAx) = (y^TA)^T = A^Ty$$ So the pieces do match up.
Trying to find general rules, I am using the matrix calculus entry in Wikipedia. Writing out a case, I find that $$\partial/\partial x (Ax) = A^T$$. while $$\partial/\partial x (x^TB) = B$$
Applying those rules gives the last two results immediately, and also $$\partial/\partial x (x^TATAx) = A^TAx + (x^TA^TA)^T = A^TAx + A^TAx = 2A^TAx$$ as expected.