The most helpful answer I could find was Jonathan's answer here, and I decided not to comment and bump a literally 6 years old thread for clarification on the explanation, and also because my question, I believe, is not a direct duplicate, because it asks for a question following this, and because it can't be contained by the word limit.
His explanation, in "Why is gradient the direction of steepest ascent?" is this:
Consider a Taylor expansion of this function, $$f({\bf r}+{\bf\delta r})=f({\bf r})+(\nabla f)\cdot{\bf\delta r}+\ldots$$ The linear correction term $(\nabla f)\cdot{\bf\delta r}$ is maximized when ${\bf\delta r}$ is in the direction of $\nabla f$.
I find this is a very graceful answer, but I have one confusion.
For the linear term, it is implied that the derivative for $f(\mathbf r + \mathbf{\delta r})$ is $\mathbf{\nabla f}$, and I can't seem to figure out why this is. Also it seems clear that a Taylor series for a vector-valued function seems to replace multiplication by what seems like its vector analog, the dot product, for which I have no reliable, rigorous understanding as to why other than it being sort of what I would assume it would be if one created a Taylor series for a vector-valued function.
Without Taylor series or polynomials, it follows directly from the directional derivative formula for a differentiable function. The directional derivative (instantaneous rate of change) of $f$ at $\mathbf a$ in the direction of a unit vector $\mathbf v$ is given by $$D_{\mathbf v}f(\mathbf a) = \nabla f(\mathbf a)\cdot\mathbf v,$$ and so you get the maximum rate of change when you move in the direction of $\nabla f(\mathbf a)$ and a zero rate of change when you move orthogonal (perpendicular) to $\mathbf a$. (This is why the gradient vector gives the normal vector to level sets of $f$.)