Hello I was recently working through the https://www.deeplearningbook.org/ and the author wrote the following;
In the context of a section on Principal Component Analysis,
Suppose $x=\{x^{(1)},x^{(2)},...x^{(m)}\}$ with each $ x \in \mathbb{R}^{n}$ For each $ x^{(i)} \in \mathbb{R}^{n}$ we want a corresponding code vector $c \in \mathbb{R}^{l}$ with $l \le n$. We want an encoder, $f(x)=c$ and a decoder $g$ so that $x$ approximates $g(f(x))$ We also constrain the columns of D to be orthogonal.
For $D \in \mathbb{R}^{n \times l}$, $c \in \mathbb{R}^{l}$ let $g(c)=Dc$ define the decoding
$$\nabla_{c}(-2x^{T}Dc+c^{T}c)=0$$
implies $c=D^{T}x$
My question is how does it go from $-2x^{T}Dc$ to $D^{T}x$ instead of $x^{T}D$
Notice that $x$ is not a matrix, $x$ is a set of vectors and so $D^T x$ and $x^T D$ are the same set of vectors. Pedantically, in one case you have a set of column vectors and in the other a set of row vectors. However, being a row or column vector is usually only relevant when you are thinking of the set as a matrix. Vectors qua vectors don't care about row-ness or column-ness.