The definition of the matrix-by-matrix derivative is:
$$ \frac{\partial X_{kl}}{\partial X_{ij}}=\delta_{ik}\delta_{lj} $$
If the matrices are $n\times n$, then the resulting matrix will be $n^2 \times n^2$.
Is the following identity valid for the matrix-by-matrix derivative?
$$ \frac{\partial}{\partial A} AB = \frac{\partial A}{\partial A} B + A\frac{\partial B}{\partial A} $$
If so, I do not understand how we can multiply a $n^2 \times n^2$ matrix by a $n \times n$ matrix?
$$ \frac{\partial}{\partial A} AB = \underbrace{\frac{\partial A}{\partial A}}_{n^2\times n^2} \overbrace{B}^{n\times n} + \overbrace{A}^{n\times n} \underbrace{\frac{\partial B}{\partial A}}_{n^2 \times n^2} $$
As suggested in the comments, computing the gradient of a matrix with respect to a matrix will result in a fourth-order tensor.
The product rule holds if you consider differentials. For example:
$$ \begin{align} F &= AB \\ dF &= dAB + A dB \end{align} $$
Now you may not want to work with fourth-order tensors, thus you can look for a "flattened", matricial represetation of the tensor. Suppose you want to the compute this representation for $\frac{\partial F}{\partial A}$.
You can proceed by vectorizing both sides:
$$ \begin{align} \rm{vec}(dF) &= \rm{vec}(dAB)\\ &= \rm{vec}(I~dAB) \end{align} $$ where I is the identity matrix.
And using the Kronecker-vec relation:
$$ \begin{align} \rm{vec}(dF) &= (B^T \otimes I) \rm{vec}(dA) \\ df &= (B^T \otimes I) ~da \end{align} $$
Thus:
\begin{align} \frac{\partial f}{\partial a} = B^T \otimes I \end{align}
Which is $n^2 \times n^2$ instead of $n\times n \times n \times n$.
If you do want to work with fourth-order tensors, without the vectorization trick, then you can proceed as in this answer.