Let's say I have some function $f(x, y)$ whose gradient at $(x_0, y_0)$ is $$\nabla f(x_0, y_0) = \langle 4, 1\rangle$$
In this 3Blue1Brown video, Grant says something akin to
the change in $f$ is $4\times$ more sensitive to changes in $x_0$ then it is to changes in $y_0$.
when discussing the gradient of a cost function in his explanation of neural networks. Why does this hold?
It seems intuitive to me that changing $x_0$ would result in a larger change in $f$ then changing $y_0$ since the gradient more closely lines up with the $x$-axis but why is it exactly $4\times$?
The assertion$$\nabla f(x_0,y_0)=(4,1)$$means that, near $(x_0,y_0)$, $f(x,y)$ behaves as $f(x_0,y_0)+4(x-x_0)+y-y_0$. So, near $(x_0,y_0)$, a small change in the value of $x$ is multiplied approximately by $4$, whereas a small change in the value of $y$ is creates a change of approximately the same size. So, yes, near $(x_0,y_0)$ $f$ is $4$ times more sensitive to changes in $x$ than to changes in $y$.
This would still hold if $\nabla f(x_0,y_0)=(4t,t)$, for some $t\ne0$.