Is derivative of max(0, x) a diagonal matrix or a vector?

307 Views Asked by At

I am reading this amazing tutorial and so far everything was clear and good. Unfortunately, there is this section which doesn't make sense to me: enter image description here Why is the derivative not a diagonal matrix but a vector? According to this page tanh's derivative is a diagonal matrix. Tanh and max looks really similar to me. The tutorial also makes it clear that elementwise binary operators have diagonal Jacobians.

And it makes sense: when I differentiate $max(0, x_i)$ w.r.t $x_j$ it should be $0$, right?

What am I missing?

2

There are 2 best solutions below

2
On

@KyleC has the right idea here. Given the functions you're looking at, I suspect you're studying neural nets. These are often implemented with libraries such as numpy to facilitate a multidimensional generalization of SIMD vectorization (see also here). From that perspective, a matrix is a vector of vectors (typically a column vector of row vectors in my experience, but that's not the only option).

0
On

You're not missing anything, you're just noticing how sloppy the math is in your current field of study.

When operating on a vector argument, functions are applied element-wise. The differential of such a function is given by $$f = f(x) \quad\implies\quad df = f'(x)\odot dx$$ where $\odot$ denotes the elementwise/Hadamard product and $f'(x)$ is the ordinary scalar derivative, which is also applied element-wise.

The Hadamard product between two vectors can always be eliminated by converting one of the vectors into a diagonal matrix, e.g. $$\eqalign{ a\odot b = Ab \quad\Longleftarrow\quad A = {\rm Diag}(a) }$$ Eliminating the Hadamard product from the differential yields the gradient as $$\eqalign{ \frac{\partial f}{\partial x} &= F' = {\rm Diag}\big(f'(x)\big) \\ }$$ These ideas apply not just to $\,\tanh(x)\,$ but to any function including $\,\max(0,x)\;-$ also known as $\,\operatorname{ReLu}(x).$


I notice that some of the comments mention broadcasting to explain/excuse the sloppy mathematics that afflicts the field of neural nets/machine learning. But broadcasting is something different.

Broadcasting simply pads the dimensions of a scalar/vector/matrix/tensor via repeated dyadic multiplication with all-ones vectors. For example $$\eqalign{ &A\in {\mathbb R}^{m\times n} \qquad &v\in {\mathbb R}^{m\times 1} \qquad {\tt1}\in {\mathbb R}^{n\times 1} \\ &A\odot v \qquad&\big({\rm incompatible}\big) \\ &A \odot (v{\tt1}^T) \qquad&\big({\rm compatible\,via\,broadcast}\big) \\ }$$ Broadcasting works for simple multiplication and division, but is worthless (and confusing) when calculating gradients and Jacobians.