Derivatives of Matrix Functions of Different Dimensions

46 Views Asked by At

Notation:

  • For matrices $A,B\in\mathbb R^{n\times m}$, we define the inner product $\langle A,B\rangle=\sum_{i,j}A_{ij}B_{ij}$

  • The basis vector $e_i$ is equal to $1$ at position $i$ and $0$ otherwise.


Problem:

Consider the function $f:\mathbb R^{n\times n}\times \mathbb R^{d\times n}\to \mathbb R$ such that $$f(X,Y)=\langle W, X-Y^\intercal Y\rangle.$$ My goal is to determine when the derivative of $f$ with respect to $X$ and $Y$ are equal to $0$. From what I understand, we can take $\nabla_Xf=W$. However, for $\nabla_Yf$, we should consider the matrix $M$ labeled with $$M_{ij}=\frac{\partial f}{\partial Y_{ij}}=e_j^\intercal WY^\intercal e_i$$ (here $e_j\in\mathbb R^n$, while $e_i\in\mathbb R^d$), meaning $M=WY^\intercal$. The problem is that $M$ is of size $d\times n$, so I'm not sure how to describe when the derivative of $f$ is equal to some matrix $Z$.


Attempt:

I thought about trying to represent the problem in a way where everything is the same shape. I noticed \begin{align} \langle W,Y^\intercal Y\rangle&=\text{Trace}(W^\intercal Y^\intercal Y)\\ &=\text{Trace}(YW^\intercal Y^\intercal)\\ &=\langle WY^\intercal, Y^\intercal\rangle\\ &=\left\langle\begin{bmatrix} 0&Y^\intercal\\ Y&0 \end{bmatrix}, \begin{bmatrix} 0&WY^\intercal\\ YW^\intercal&0 \end{bmatrix} \right\rangle, \end{align} meaning we can rewrite $f$ as $$ f(X, Y)=\left\langle\begin{bmatrix} X&Y^\intercal\\ Y&0 \end{bmatrix}, \begin{bmatrix} W&-WY^\intercal/2\\ -YW^\intercal/2&0 \end{bmatrix} \right\rangle $$ so I'm thinking we can represent $\nabla_{X,Y}f$ as $$\begin{bmatrix} W&-WY^\intercal/2\\ -YW^\intercal/2&0\end{bmatrix}.$$

If $Z=\begin{bmatrix} Z_1&Z_2^\intercal\\ Z_2 & Z_3 \end{bmatrix}$,

Then the derivative is equal to $Z$ when $W=Z_1$, $YW^\intercal=-2Z_2$ (meaning $Y^\intercal=-2W^+Z_2$?), and $Z_3=0$.

Does this make any sense? How is this usually done?

1

There are 1 best solutions below

0
On BEST ANSWER

To talk about gradients, you need an inner product. You are using the Frobenius inner product on $\mathbb R^{n\times m}$. You need to first consider what inner product you are using on $\mathbb R^{n\times n}\times \mathbb R^{d\times n}$. A reasonable choice here is $\langle (X_1,Y_1) , (X_2,Y_2)\rangle =\langle X_1, X_2 \rangle + \langle Y_1, Y_2 \rangle$ (sum of the Frobenius inner products).

However, this seems like notational overkill here.

If you are just trying to find the derivatives with respect to $X,Y$ separately you do not need to go this route. Since we can write $f(X,Y) = f_1(X) + f_2(Y)$, it is easier to take the gradients of $X,Y$ separately.

Let $f_1(X) =\langle W, X\rangle$, $f_2(Y) = \langle W, Y^T Y\rangle$,

Since $f_1$ is linear, the gradient with respect to $X$ is just $W$.

For $f_2$, we can expand explicitly to get $f_2(Y+H)= \langle W, (Y^T +H^T) (Y+H)\rangle$ and we can write this as $f_2(Y+H)-f_2(Y) = L(H) + \langle W, H^T H\rangle$ where $L$ is the linear term from which we will extract the gradient.

Some easy identities for the Frobenius inner product are $\langle A, B\rangle = \langle B, A\rangle = \langle A^T, B^T\rangle$, $\langle A, BC\rangle =\langle B^TA, C\rangle$.

Using these we get $L(Y) = -\langle W, Y^TH +H^TY \rangle = \langle -Y(W+W^T), H \rangle$ and so we see that the gradient with respect to $Y$ is $-Y(W+W^T)$.

In particular, the only way the gradient with respect to $X$ can be zero is if $W=0$.