I am trying to conceptually understand the gradient descent algorithm update rule \ $$\theta_1 = \theta_0 - \alpha \nabla_{\theta} J(\theta_0)$$
where $J(\theta)$ is the function that is being minimized.
Ever since I can remember, I was told that the definition of derivative/gradient was the "ratio of small changes applied to $y$ to small changes applied to $x$."
From that I get
$$\triangle y \approx \triangle x \frac{dy}{dx}$$
But the gradient descent rule seems to be defying that definition and logic. I am interpreting the expression $$\alpha \nabla_{\theta}J(\theta_0)$$ term like the above definition, so that it gives the approximate change in $J$.
Basically, $$J_1-J_0 = \triangle J \approx \alpha \nabla_{\theta}J(\theta_0)$$
Obviously, if we follow the logic of this, the units are mismatched because then this expression $\alpha \frac{d}{d\theta}J(\theta_0)$ is subtracted from $\theta_0$. It is like if $J$ were oranges and $\theta$ were apples and we wanted to see the relationship between oranges and apples. We would be subtracting the wrong quantities.
So can anyone help me understand why this isn't the correct way of looking at the update rule? Wouldn't it make more sense to write the rule as $$\theta_1 = \theta_0 - \alpha (\nabla_{\theta}J(\theta_0))^{-1}$$