I'd like to know $\frac{\partial f(\mathbf{U})}{\partial \mathbf{U}}$, i.e., the 'by-matrix derivative' of the following scalar function $f(\mathbf{U})$ w.r.t. $\mathbf{U}$.
$$f(\mathbf{U}) = \vec{x}^T \mathbf{U} \mathbf{D} \mathbf{U}^T \vec{x}\;,$$
where $\vec{x} \in \mathbb{R}^n$ is a column vector, $\mathbf{U} \in \mathbb{R}^{n \times n}$ is a unitary matrix ($\mathbf{U}^T\mathbf{U} = \mathbf{I}_n$), $\mathbf{D} \in \{0,1\}^{n \times n}$ is a diagonal matrix ($\mathbf{D} \neq \mathbf{I}_n$).
I found in The Matrix Cookbook, see eq. (82), the derivative $\frac{\partial g(\mathbf{U})}{\partial \mathbf{U}}$ of
$$g(\mathbf{U}) = \vec{x}^T \mathbf{U}^T \mathbf{D} \mathbf{U} \vec{x}\;.$$
Please note the difference in the transposition of $\mathbf{U}$ for $f(\mathbf{U})$ and $g(\mathbf{U})$.
From the earlier question "Derivative of inverse quadratic function of a matrix" I learned that $\frac{\partial f(\mathbf{U})}{\partial u_{ij}} = \vec{x}^T (\mathbf{U} \mathbf{D} \mathbf{J}^{ij} + \mathbf{J}^{ji} \mathbf{D} \mathbf{U}^T) \vec{x}$. Unfortunately, I can't figure out how to combine it to a 'closed matrix notation'. I end up with $\frac{\partial f(\mathbf{U})}{\partial u_{ij}} = \mathbf{D}\mathbf{U}^T\vec{x}\vec{x}^T \vert_{ij} + \vec{x}\vec{x}^T\mathbf{U}\mathbf{D}\vert_{ji}$.
Any help is appreciated!
A straightforward way is to compute $f(U+H) = x^T (U+H) D (U+H)^T x = f(U)+x^T HDU^Tx + x^TUDH^T x + f(H)$, and note that $|f(H)| \le K \|H\|^2$ for some $K$.
It follows that the derivative is given by $Df(U)(H) = x^T HDU^Tx + x^TUDH^T x$. Since $f$ is real valued and $D^T=D$, we can write $Df(U)(H) = 2x^TUDH^T x$.
We have ${ \partial f(U) \over \partial U_{ij} } = Df(U)(E_{ij}) = 2x^TUD E_{ji} x = 2x^TUD e_j e_i^T x$.
Comments:
This is the definition (or one of a few equivalents) of differentiability:
A function $f:V \to W$ where $V,W$ are Banach spaces is said to be (Fréchet) differentiable at $x$ iff there exists a continuous linear operator $A:V \to W$ such that for all $\epsilon>0$, there exists some $\delta >0$ such that if $\|h\| <\delta$, then $\|f(x+h)-f(x) - A(h) \| \le \epsilon \|h\|$. The operator $A$ is called the derivative of $f$ at $x$.
A few points:
(1) In our case, $V=\mathbb{R}^{n \times n}$, $W = \mathbb{R}$.
(2) The derivative operator is often denoted $Df(x)$. Note that $Df(x):V \to W$. So, given $h \in V$, we write $Df(x)(h) \in W$ to denote the operator applied to $h$ (perhaps think of $h$ as a perturbation).
(3) The idea of differentiability is to quantify the difference $f(x+h)-f(x)$ in some way. Some folks write $f(x+h) =f(x)+A(h) + o(h)$.
(4) The linear operator $A$ cannot always be expressed as a matrix multiplication. For example, take the trace $\operatorname{tr}: \mathbb{R}^{n \times n} \to \mathbb{R}$. This is a differentiable function, but you cannot write down a single matrix multiplication that represents the derivative (in fact, we have $D \operatorname{tr}(x)(h) = \operatorname{tr}(h)$). This is a confusing point for many folks as we typically (in the $\mathbb{R}^n \to \mathbb{R}^m$ case) write the derivative as a matrix multiplication. The derivative of the function $f$ above cannot be written as a simple matrix multiplication.
To answer the questions in your comment below:
To compute the derivative of $f$, we compute $f(U+H)-f(U)$ and look for linear and higher order terms. We have $f(U+H)-f(U) = x^T HDU^Tx + x^TUDH^T x + f(H)$, the term $H \mapsto x^T HDU^Tx + x^TUDH^T x$ is linear (and continuous) in $H$, and the term $f(H)$ can be bounded by $K\|H\|^2$. Hence from the definition, we see that $f$ is differentiable at $U$, and the derivative applied to the direction $H$ is given by $Df(U)(H) = x^T HDU^Tx + x^TUDH^T x$. The derivative is a function $Df(U): \mathbb{R}^{n\times n} \to \mathbb{R}$, but cannot be written as a simply matrix multiplication of some matrix and $H$.
The expression $Df(U)(H) = x^T HDU^Tx + x^TUDH^T x$ completely defines the derivative of $f$.
Now for a slight backtrack: While what I wrote above is correct, there is a sense in which you can write down a single object that represents the derivative.
From above, we can write (using properties of the trace operator) $Df(U)(H) = 2x^T HDU^Tx = \operatorname{tr} (2x^T HDU^Tx) = \operatorname{tr}( 2 DU^Txx^T H ) = \operatorname{tr}( (2 xx^T U D )^T H )$.
If one uses the Frobenius norm and the corresponding inner product, we see that we can write $Df(U)(H) = \langle 2 xx^T U D, H \rangle $, so we can write the gradient $\nabla f(U) = 2 xx^T U D$.
However, you must realise that this is not just a simple matrix multiplication, and that the trace is intimately involved.