I know that the gradient of $X \mapsto \mbox{Tr}(XA)$ is $A^T$. However, how does this change if we had a scenario where $A$ and $X$ are swapped. Is the gradient $X \mapsto \mbox{Tr}(AX)$ the same?
Also, how does this extend if we have more matrices? We can just assume everything before our "$X$" is $A$, correct? For example, $X \mapsto\mbox{Tr}\left(U^T V X\right)$. We can assume this is similar to the above where $U^TV$ is our "$A$" matrix, right?
Theorem: ${\mathrm{d} f({X})= \text{trace}(M^T \mathrm{d} {X}) \iff \frac{\partial f}{\partial {X}} = M}$
In your case,
$$\mathrm d \ \text{trace}(AXB) = \text{trace}(\mathrm d (AX B)) = \text{trace}(A \ \mathrm d X\ B) = \text{trace}(B A \ \mathrm d X)$$ and thus we identify $(BA)^T = A^T B^T$ as the derivative.