Vectorization and transpose: how are $\text{vec}(W^T)$ and $\text{vec}(W)$ related?

2.3k Views Asked by At

In solving for a gradient, I ended up with a differential that looks similar to: $$ dT = (a^T \otimes b^T)\ \text{vec}[d[W]^T] + (b^T \otimes c^T)\ \text{vec}[d[W]] $$ and I am trying to solve for $\frac{\partial T}{\partial \text{vec}W}$. The second term isn't a big problem, since we can just flip-flop the vec and d operators, but the first term needs to handle the transpose and I wasn't sure how to do it.

Note: Although I hope it isn't nessesary, the actual equation I am solving for is $\frac{\partial T}{\partial \text{vec}W}$ with $T$ defined below: $$ T = (y - f_1(W_1f_0(W_0x)))^TB_0^Tf_0(W_0x) $$ using elementwise differentiable functions $f_i$, and non-square matricies $W_i$, and static vector $y$.

Edit: Based on numerical experiments it seems that this relationship is true: $$ T = b^TW^Ta + c^TWb\\ dT_0 = (a^T \otimes b^T)\text{vec}[d[W]^T] \implies \nabla_{\text{vec} W} T_0 = (a^T \otimes b^T)\\ dT_1 = (b^T \otimes c^T)\text{vec}[d[W]] \implies \nabla_{\text{vec} W} T_1 = (c^T \otimes b^T)\\ $$ But this doesn't make any sense to me. Why would the non-transposed version flip the order of the kronecker product?

1

There are 1 best solutions below

2
On

So I have learned that the matrix @amd referred to is called the commutation matrix. Basically, for a given $W \in \mathbb{R}^{n \times m}$ there exists $K_{mn} \in \mathbb{R}^{mn \times mn}$ such that

$$ \text{vec}[W^T] = K\text{vec}[{W}]\\ \text{vec}[W] = K\text{vec}[{W^T}]\\ $$

As it turns out, the method for building $K_{mn}$ is the same way you build the matrix that allows you to commute the terms of a Kronecker product. For $A \in \mathbb{R}^{a\times b}$, $B \in \mathbb{R}^{c\times d}$ there exists $K_{ad}$ and $K_{bc}$ such that: $$ K_{ad} (A \otimes B) K_{bc} = (B \otimes A) $$

In my particular case that I was asking about (where the Kronecker product was between two row vectors) we have a special case where $A = a^T \in \mathbb{R}^{1\times a}$ and $B = b^T \in \mathbb{R}^{1\times b}$ so that:

$$ K_{11} (a^T \otimes b^T) K_{bc} = (b^T \otimes a^T)\\ (a^T \otimes b^T) K_{bc} = (b^T \otimes a^T) $$ Because $K_{11}$ is the one-by-one matrix/scalar $1$. This results in the identity I found via experimentation: $$ \begin{align*} dT &= (a^T \otimes b^T)d[\text{vec}[W^T]]\\ &= (a^T \otimes b^T)d[K_{bc}\text{vec}[W]]\\ &= (a^T \otimes b^T)K_{bc}d[\text{vec}[W]]\\ &= (b^T \otimes a^T)d[\text{vec}[W]]\\ \frac{\partial T}{\partial \text{vec}[W]} &= (b^T \otimes a^T) \end{align*} $$

The other thing that was a little confusing is that $\nabla_{\text{vec} W} T = \frac{\partial T}{\partial \text{vec} W^T}$, not w.r.t $\text{vec}[W]$ as I was thinking originally.