Let us simplify $(Ax-b)^T(Ax-b)$ step by step, $A \in R^{m \times n}$ matrix, $x \in R^{n \times 1}$, $b \in R^{m \times 1}$
$(Ax-b)^T(Ax-b)$
$= (x^TA^T - b^T)(Ax - b)$ (property of transpose)
$ = x^TA^TAx - x^TA^Tb - b^TAx + b^Tb$ (distribution property)
Now what is the correct way to combine the inner terms
$-x^TA^Tb - b^TAx $?
or equivalently
$x^TA^Tb + b^TAx $?
I find myself constantly asking the following questions
- Is it legal or illegal to transpose any of the two terms? Why?
- Which term ($x^TA^Tb$ or $b^TAx $) should you take the transpose of? Why?
I would really appreciate if someone can resolve this for me
Since both $x^TA^Tb$ and $b^TAx$ are scalars, you can transpose any of them.
You can transpose $x^TA^Tb$ and get $$x^TA^Tb + b^TAx = b^TAx + b^TAx = 2 b^TAx$$ or you can transpose $b^TAx$ and get $$x^TA^Tb + b^TAx = x^TA^Tb + x^TA^Tb = 2 x^TA^Tb$$ and since both results are scalar, they are equal to their respective transposes, we get: $$ 2 x^TA^Tb = 2 (x^TA^Tb)^T = 2 b^TAx.$$
So to summarize, you can take the transposes of either one of them, and should probably take the transpose of $x^TA^Tb$ since you will reach the simpler result $2b^TAx$ faster (it can be considered simpler since there is no transpose of $A$ in it).