Context: I'm trying to figure out how backpropagation works on a variational autoencoder mathematically, by just implementing the backprop algorithm. I'm having trouble figuring out exactly how to do the math to backprop the Kullback-Liebler Divergence part of the loss. However, I think I can get my question answered without any background on deep generative models. Here is the code to the loss function in case you're curious.
Say I have a function $$KL=-.5(1 + l - m^2 - e^l)$$ $$l= Wx + c$$
Where x is a vector $x \in \mathbb{R}^b$, $W \in \mathbb{R}^{a \times b}, b \in \mathbb{R}^a, l \in \mathbb{R}^a, m \in \mathbb{R}^a$. $e^l$ refers to exponentiating the vector elementwise, "1 +" means add one to the vector elementwise, $m^2$ refers to taking the second power elementwise (np.pow). Lets say I'm interested in $dKL/dW$, and $l$ is the only term that depends on $W$. $$dKL/dW = -.5x +.5e^{Wx+c}x$$
What should be the dimension of the final result, and how do I know? Seems like $x \in R^b$, and $.5e^{Wx+c}x \in R^{a\times b}$? Is this right? If so can I add the terms together somehow?
EDIT: Am I doing this right?
Let $l = W_{1}Relu(W_{0}x + b) + b_{1}$, find $d\lambda/dW_{0}$: $$ \text{Let }dRelu(W_{0}x + b) = dRelu$$ $$dl = W_1dRelu \circ dW_0x $$ $$d\lambda = \alpha (dp - dl) : 1$$ $$= \alpha (p \circ dl - W_1dRelu \circ dW_0x ) : 1$$ $$= \alpha (p \circ W_1dRelu \circ dW_0x - W_1dRelu \circ dW_0x ) : 1$$ $$= \alpha (p \circ W_1dRelu - W_1dRelu) \circ dW_0x : 1$$ $$= \alpha (p \circ W_1dRelu - W_1dRelu) : dW_0x $$ $$= \alpha (p \circ W_1dRelu - W_1dRelu)x^T : dW_0$$ $$d\lambda/dW_0 = \alpha (p \circ W_1dRelu - W_1dRelu)x^T$$
I will use the notation $$\eqalign{ B &= C\circ A &\implies B_{ij} = C_{ij}A_{ij} \cr \beta &= C:A &\implies \beta = \sum_i\sum_j C_{ij}A_{ij} \cr }$$ to denote the elementwise/Hadamard and trace/Frobenius products, respectively. These products are applicable to vectors as well as matrices.
Given the following variables $$\eqalign{ l &= Wx+c &\implies dl = dW\,x\cr p &= \exp(l) &\implies dp = p\circ dl \cr \alpha &= \frac{1}{2}\cr KL &= \alpha(p+m\circ m&-\,l-1) \cr \lambda &= 1:(KL) &\implies 1 \in{\mathbb R}^{a} \cr }$$ Note that the KL-divergence is a scalar. Looking at the Github code that you linked, you neglected to include the "torch.sum()" function in your problem statement.
Hence the scalar $\lambda$ is defined as the sum of the elements of your $(KL)$ vector via a Frobenius product with $1$, the vector of all ones.
Find the differential and gradient of the scalar divergence $\lambda$ $$\eqalign{ d\lambda &= \alpha(dp-dl):1 \cr &= \alpha(p\circ dl-1\circ dl):1 \cr &= \alpha(p-1):dl \cr &= \alpha(p-1):dW\,x \cr &= \alpha(p-1)x^T:dW \cr \frac{\partial\lambda}{\partial W} &= \alpha(p-1)x^T = \frac{1}{2}(e^l-1)x^T \cr }$$ So the gradient has the same shape as $W\in{\mathbb R}^{a\times b}$
A quick review of some properties of the Hadamard and Frobenius products might prove helpful.
Since the Frobenius product is equivalent to the trace $$A:B={\rm tr}(A^TB)$$ the cyclic and transpositional properties of the trace $$\eqalign{ {\rm tr}(A^TB) &= {\rm tr}(AB^T) \cr {\rm tr}(A^TBC) &= {\rm tr}(CA^TB) = {\rm tr}((AC^T)^TB) \cr }$$ give rise to similar properties in the Frobenius product $$\eqalign{ A:B &= A^T:B^T \cr A:BC &= AC^T:B \cr }$$ Another useful property is that the Hadamard and Frobenius products commute with themselves and each other $$\eqalign{ A:B &= B:A \cr A\circ B &= B\circ A \cr A:(B\circ C) &= (A\circ B):C \cr }$$ The matrix of all ones is the identity element for Hadamard products $$A\circ 1 = A$$
Update for the ReLu Problem
Define $$\eqalign{ z &= W_0x+b &\implies dz=dW_0\,x\cr h &= {\rm step}(z) &\implies H={\rm Diag}(h)\cr r &= {\rm relu}(z) &\implies dr=h\circ dz = H\,dz \cr l &= W_1r + b &\implies dl=W_1dr\cr p &= \exp(l) &\implies dp=p\circ dl \cr }$$ Then $$\eqalign{ \lambda &= \alpha(p-l+m\circ m-1):1 \cr d\lambda &= \alpha(p-1):dl \cr &= \alpha(p-1):W_1dr \cr &= \alpha W_1^T(p-1):H\,dz \cr &= \alpha HW_1^T(p-1):dW_0\,x \cr &= \alpha HW_1^T(p-1)x^T:dW_0 \cr \frac{\partial\lambda}{\partial W_0} &= \alpha HW_1^T(p-1)x^T \cr }$$ Replacing a Hadamard product with a vector, by a matrix product with a diagonal matrix is a standard trick for simplifying equations like this $$h\circ v=Hv$$