I have the following variables/matrices:
$$A \in \mathbb{R}^{m \times n} , \quad p \in \mathbb{R}^{n}, \quad \Sigma \in \mathbb{R}^{m \times m}, \quad w \in \mathbb{R}^{m}$$
where $\Sigma$ is a diagonal matrix. With these we define function $S(p)$ as $$S(p) = (w + Ap)^{T} \Sigma^{-1} (w + Ap)$$
Since we would like to find the minimum of $S(p)$ we compute the first derivation with respect to $p$, according to my master's solution this is $$\nabla S(p) = 2(Ap + w)^{T} \Sigma^{-1} A \overset{!}{=} 0$$
However I don't understand how they arrive at this solution, could somebody please explain the intermediary steps?
Let's take a look at the derivative with respect to the first coordinate.
First we apply the product rule. Then we note that the expression is a scalar, so we can also write it as a transpose. The transpose of a scalar is trivially the same as the scalar.
\begin{aligned}\frac{\partial}{\partial x} S(p) &= \frac{\partial}{\partial x}\left( (w+Ap)^T \Sigma^{-1} (w+Ap) \right) \\ &= \left(\frac{\partial}{\partial x}(w+Ap)\right)^T \Sigma^{-1} (w+Ap) + (w+Ap)^T \Sigma^{-1} \frac{\partial}{\partial x}(w+Ap) \\ &= \left(\left(\frac{\partial}{\partial x}(w+Ap)\right)^T \Sigma^{-1} (w+Ap)\right)^T+ (w+Ap)^T \Sigma^{-1} \frac{\partial}{\partial x}(w+Ap) \\ &= (w+Ap)^T \Sigma^{-1} \frac{\partial}{\partial x}(w+Ap)+ (w+Ap)^T \Sigma^{-1} \frac{\partial}{\partial x}(w+Ap) \\ &= 2 (w+Ap)^T \Sigma^{-1} \frac{\partial}{\partial x}(w+Ap) \\ &= 2 (w+Ap)^T \Sigma^{-1} (A\hat x) \end{aligned}
More generally, we can write this as: $$\nabla_i S(p) = 2 (w+Ap)^T \Sigma^{-1} (A e_i)$$ Or: $$\nabla S(p) = 2 (w+Ap)^T \Sigma^{-1} (A I) = 2 (w+Ap)^T \Sigma^{-1} A$$