I am reading this amazing tutorial and so far everything was clear and good. Unfortunately, there is this section which doesn't make sense to me:
Why is the derivative not a diagonal matrix but a vector?
According to this page tanh's derivative is a diagonal matrix. Tanh and max looks really similar to me. The tutorial also makes it clear that elementwise binary operators have diagonal Jacobians.
And it makes sense: when I differentiate $max(0, x_i)$ w.r.t $x_j$ it should be $0$, right?
What am I missing?
@KyleC has the right idea here. Given the functions you're looking at, I suspect you're studying neural nets. These are often implemented with libraries such as numpy to facilitate a multidimensional generalization of SIMD vectorization (see also here). From that perspective, a matrix is a vector of vectors (typically a column vector of row vectors in my experience, but that's not the only option).