For the sake of simplicity the following notation $a_k := a(k)$ is assumed for time sequences.
A completely general discrete-time (DT) non-linear(NL) time-invariant (TI) dynamical system can be described with a system of vector difference equations, where $x(k)$ is the system state, $u(k)$ is the system input and $y(k)$ is the system output: $$\begin{cases} x(k+1) = f(x(k),u(k))\\ y(k) = g(x(k), u(k)) \end{cases} \quad\forall k\ge0$$ $f$ and $g$ are general non linear functions.
The first equation is very close to a general autonomous non linear differential equation with $\dot x(t)=f(x(t),u(t))$ where $u(t)$ is the source term.
Now, the same dynamical DT NL TI system could be described also through an input output relationship, involving more among their'samples' (current and past): $$y(k) = h(y(k-1),y(k-2),...,y(k-n),u(k),u(k-1),...,u(k-m)), \quad\forall k\ge0$$ where again, $h$ is a general non linear function, and $n,m$ are integer positive values.
How can one prove that this input-relationship form is equivalent to the former state-space representation, at least when $x(0)=0$ (without any care about a possible relationship between $f,g$ and $h$)?
EDIT: And, at least, how can one be sure that the IO relationship involves finite past samples and not all the past samples (since the initial time instant)?
One can do this by for example defining the state vector as follows,
$$ x(k) = \begin{bmatrix} y(k) \\ y(k-1) \\ \vdots \\ y(k-n+1) \\ u(k-1) \\ u(k-2) \\ \vdots \\ u(k-m+1) \end{bmatrix}. $$
By using your second update formula for $y(k)$, one can quite easily find $f(x(k),u(k))$, because it should describe $x(k+1)$. Assuming that $y(k)$ is a vector of length $l$, then for the top $l$ states you can just use the second update formula (but increment the value of $k$ by one), since all the required inputs for $h(\cdot)$ are present in $x(k)$ and $u(k)$. The remaining states will just be a down shift of the already known states of $x(k)$, omitting $y(k-n+1)$ when the history of $u(k)$ starts, where instead $u(k)$ is inserted. After this the down shifting of the already known states of $x(k)$ continues, until the end of the state vector, omitting $u(k-m+1)$. And $g(x(k),u(k))$ should be trivial as well, namely just take the top state(s) of $x(k)$.