Let y and x be n−dimensional vectors related by $y=f(x)$ , $L$ be a differentiable loss function. According to the chain rule of calculus,$∇_xL= (∂y∂x)>∇_yL$, which takes up $O(n^2)$ computational time in general (as it requires a matrix-vector multiplication).
Show that if $f(x) = σ(x)$ or $f(x) =Softmax(x)$, the above matrix-vector multiplication canbe simplified to a $O(n)$ operation. Note that here, we used the sigmoid function for a vector input $x= (x_1,...,x_n) \in R^n →σ(x) = (σ(x1),...,σ(xn))$
Well, if $i \neq j$ then $\frac{ \partial \sigma(x_i)}{\partial x_j} = 0 $ so $\nabla f$ is sparse, only the n entries along the diagonal are non-zero, etc.