let $f: \mathbb R^2 \to \mathbb R,\ (x,y) \mapsto x$
now the graph of $f(x,y)=x$ looks like this:
the gradient is: $\nabla f(x,y) =\begin{pmatrix} 1 \\ 0 \end{pmatrix}$
I'm trying to get a proper understanding here. I always thought of the gradient at a specific point $(x_0,y_0)$ pointing in the direction of the steepest ascent.
I think I have a hard time grasping the sentence "in the direction of the steepest ascent". That's not literally a vector in 3D right? It'd be a vector on the x-y-plane right? So
So when thinking about the gradient, I shouldn't really think about the graph but the codomain? Because I can't really make sense of it otherwise.


The gradient is a vector in 2D ($\nabla f=(\partial_x f, \partial_y f)$). Take a point $x$ in $\mathbb{R}^2$. Also take the vector $v:=\nabla f(x)/|\nabla f (x)|$. Then, among all points $x+u$, $u \in \mathbb{R}^2$, $|u|=1$, the (linearized) function $f$ has its maximum value at $x+v$.
That's what it means 'direction' (unitary vector) 'of steepest ascent' (such that if you move from $x$ according to this vector, the linearisation of $f$ has its maximum value).
So the gradient has to be 2D, as you sum it to $x$, a point in 2D, and since it is the collection of two derivatives.
With 'linearized $f$' I mean the function $g(y)=f(x)+\nabla f(x) \cdot (y-x)$, the best first order approximation of $f$ near $x$. Luckily for you, in your case, the linearization is the function itself. But say we had a more weird looking $f$.
We use this function since we are considering a local property of $f$, i.e. the 'instant' changing behaviour of $f$ at $x$, when moving towards some direction: if we looked at the change of $f$ (not $g$) at points at distance $1$ from $x$, we'd lose control of the local behaviour of $f$ near $x$, since $f$ can behave very differently at distant points. By looking at the linearized version, we're sure we are looking at something that resembles $f$ at our point of interest, and that retains that information also at distant points.
But why do we look at points at distance $1$ from $x$? Because this is how one defines a direction (a unitary vector), but also because if you allowed $y$ to be as distant as you want from $x$ you could get an ascent as big or as small as you want. So, we limit ourselves to unitary vectors to end up with a well posed question.