Derivative for Masked Matrix Hadamard Multiplication

335 Views Asked by At

In deep learning, such an operation is common:

$$A = B\circ (C>0.2)$$

where $A,B,C\in \mathbb{R}^{n\times m}$, $\circ$ denotes Hadamard Multiplication and $C>0.2$ is the matrix where each element is itself when it is larger than $0.2$ and is masked as $0$ otherwise.

I want to know the partial derivative for $A$ with respect to $C$, formally $$\frac{\partial A}{\partial C}$$

1

There are 1 best solutions below

4
On BEST ANSWER

$\def\v{{\rm vec}}\def\d{{\rm diag}}\def\D{{\rm Diag}}\def\o{{\tt1}}\def\p{{\partial}}\def\grad#1#2{\frac{\p #1}{\p #2}}\def\hess#1#2#3{\frac{\p^2 #1}{\p #2\,\p #3^T}}\def\bb{\mathbb}$It's simpler use the vec() operator and deal with a vector equation in $\,{\bb R}^{mn\times 1}$ $$a = b\circ (c>\lambda)$$ Make the following definitions $$\eqalign{ z &= c-\lambda\o \qquad&(\lambda\,{\rm is\,an\,arbitrary\,scalar}) \\ {\cal H}(z_k) &= \begin{cases}1\quad{\rm if}\quad z_k>0\\0\quad{\rm otherwise} \end{cases} \qquad&({\rm Heaviside\,step\,function}) \\ h &= {\cal H}(z) \qquad&({\rm apply\,the\,function\,elementwise}) \\ }$$ Write the problem in terms of the above, then calculate its differential and gradient. $$\eqalign{ a &= b\circ h\circ c \\ da &= b\circ h\circ dc \;=\; \D(b\circ h)\; dc \\ \grad{a}{c} &= \D(b\circ h) \\\\ }$$


Note that the quantity $G=\left(\grad{a}{c}\right)$ calculated above is a ${\bb R}^{mn\times mn}$ matrix whereas the requested quantity $\Gamma=\left(\grad{A}{C}\right)\in{\bb R}^{m\times n\times m\times n}$ which is a fourth-order tensor. The individual elements are identical $\big({\rm e.g.}\;\Gamma_{1111}=G_{11}\big);\,$ the tensor has simply been reshaped into a matrix.

The elements of the matrix can be reshaped into a tensor, if that is the desired form. Personally I use Julia and find the matrix form more convenient to work with, since I can use regular built-in matrix*vector product as opposed to writing explicit for-loops.

Consider the matrix calculation

G,dc  = rand(m*n,m*n), rand(m*n)
da = G*dc

versus the tensor calculation

Γ,dC,dA = reshape(G,m,n,m,n), reshape(dc,m,n), zeros(m,n)
for i = 1:m
  for j = 1:n
    for k = 1:m
      for l = 1:n
        dA[i,j] += Γ[i,j,k,l] * dC[k,l]
      end
    end
  end
end

norm(vec(dA) - da)  # approx == machine epsilon

Yes, there are tensor packages available, but it's still awkward compared to working with vectors and matrices.