Suppose I define two functions of $x$ in terms of a convex function $f$ with a unique minimum $x_0$:
$$f_1(x) = 1 \times f(x)$$
$$f_2(x) = 2 \times f(x)$$
Suppose I wanted to minimize each of these functions numerically.
Clearly, the minima of both occur at the same $x_0$.
But if I tried to do gradient descent, I would be performing the updates
$$x \gets x - \lambda \nabla f_k(x)$$
and therefore step sizes would be larger for $f_2$ than $f_1$.
To me, this doesn't make sense. The step sizes should not depend on the scaling factor!
It therefore seems to imply that gradient descent should be normalized such that the step sizes have constant magnitude regardless of the gradient's magnitude.
So what justification is there for having the step size increase with the magnitude of the gradient when it's obvious that scaling the original function will change the gradient but not the location of its extrema?
If your step size $\lambda$ is small enough, then when you update $$x_{t+1} = x_t - \lambda \nabla f_k(x)$$ you can ensure that $f_k(x_t) \geq f_k(x_{t+1})$. At least for the function you performed a gradient descent step on. So, if $\lambda$ is small enough for $f_1$, maybe it's not small enough for $f_2$.
The normalization you're talking about is something like one of the methods based on newton's method or conjugate gradient.