The following is part of an assignment so no full answers please.
I am given the following VARX process:
$$ Y(t) = \mu + \sum_{p=1}^P A_p Y(t-p) + BU(t) + \epsilon (t), $$
with $t=1, \dots ,T$. Moreover: $Y(t) \in \mathbb{R}^n, \mu \in \mathbb{R}^n, A_p \in \mathbb{R}^{n\times n}, B \in \mathbb{R}^{n\times d}, U(t) \in \mathbb{R}^d$ and $\epsilon (t)$ is $N(0_n, \sigma I_{n\times n})$ and describes the noise. The goal is to estimate the parameters $\mu, A_p,B$ with least squares. I have some difficulties with the derivation of the matrix $A$:
$$ \frac{\partial}{\partial A_p} S = \sum_{t=1}^T \left[ \left( Y(t)- \mu - \sum_p A_p Y(t-p) + BU(t) \right) \sum_p \frac{\partial}{\partial A_p} \left( A_p Y(t-p) \right) \right] = 0 .$$
Regarding the partial derivative in this equation, is it correct to do the following:
$$ \sum_p \left[ \frac{\partial}{\partial A_p} \left( \sum_{i=1}^n \sum_{j=1}^n (a_p)_{i,j} y_j(t-p) \right) \right] = \sum_p \left( \sum_{i=1}^n \sum_{j=1}^n y_j(t-p) \right) ?$$
There is a lot of extraneous detail in your problem statement, so let's start with some cleanup.
First, we'll reserve uppercase letters for matrices, and use lowercase letters for vectors, and greek letters for scalars.
Next, define vectors which omit the dependence on the time-step which is irrelevant to finding the derivative wrt the $A_p$ parameter. $$\eqalign{ y &= y(t) \cr w_p &= y(t-p) \cr c &= c(t) = \epsilon(t) + Bu(t) + \mu \cr z &= c+\sum_p A_pw_p - y\cr }$$ Now write the norm using the inner/Frobenius product (represented by a colon) $$\eqalign{ \phi &= \|z\|_F^2 = z:z \cr d\phi &= 2z:dz \cr &= 2z:\sum_p dA_pw_p \cr &= \sum_p\Big(2z: dA_pw_p\Big) \cr }$$ I'll let you figure out how to obtain $\frac{\partial\phi}{\partial A_p}$ from there.
Note that the Frobenius product is merely a convenient infix notation for the trace, i.e. $$A:B={\rm tr}(A^TB)$$