Here is some intuition for Ito's formula.
The Taylor expansion for a function $f$ about a point $y$ is $$ f(y) = f(x) + f'(x)(x-y) + \frac12 f''(x)(x-y)^2 + \dots \,.$$
If you replace $x-y$ with $dx$ and $f(y) - f(x)$ with $df(x)$, then $$ df(x) = f'(x)dx + \frac12 f''(x) dx^2 + \dots \,.$$
If you keep only the first term, you have the formula for the differential, $df = f' dx$.
If you keep the first two terms, you have Ito's formula, $ df = f'dx + \frac12 f'' dx^2$.
Is there some explanation for why functions of stochastic processes need the second derivative term when taking the differential of $f$? I know that we use the fact that "$dz^2 = dt$", where $z$ is a Brownian motion, but I don't fully understand that. I know $\mathbb{E}[B_t^2] = t$, is that related? Why is the second-order term in the Taylor expansion for the regular differentual zero?
Edit: on slide 8 of these lecture notes, we have If $dX_t = a dt + b dB_t$ is an Ito process, then \begin{align*} (dX_t)^2 &= (adt + b dB_t)^2 \\ &= a^2 dt^2 + 2(adt)(bdB_t) + (bdB_t)^2 \\ &= bdB_t^2 \,. \end{align*}
Why are the first two terms zero? (I also don't understand why the term $dB_t dt =0$, on page 10.)
Thomas Kojar has provided an answer and some references already, but here is an intuitive explanation, in order to stay in the same context/spirit as your question.
1) Why $\mathrm{d}x^2 \sim 0$ in the standard case ?
It is to be recalled that, in the standard case of a function depending on a non-stochastic variable, the differential $\mathrm{d}f(x)$ is a somewhat "halfway unfinished computation" of the derivative $f'(x)$, so that $$ \frac{\mathrm{d}f}{\mathrm{d}x} = f'(x) + \frac{1}{2}f''(x)\,\mathrm{d}x + \ldots, $$ with all terms except for the first one vanishing when $\mathrm{d}x \rightarrow 0$. The confusing ambiguity comes from the fact that usually the limit is already contained implicitly inside the "d" notation; here, the notation is a little bit abused by using $\mathrm{d}f$ and $\mathrm{d}x$ before taking the limit.
2) How to treat $\mathrm{d}X_t^2$ in the stochastic case and why $\mathrm{d}B_t\mathrm{d}t \sim 0$ ?
In contrast, when the independent variable $x$ is random, namely $X_t$, then (some) terms inside $\mathrm{d}f(X_t)$ coming from $\mathrm{d}X_t^2$ cannot be ignored, because $\mathbb{E}[\mathrm{d}B_t^2] = \mathrm{d}t$. It is to be noted that another abuse of notation $-$ in a way, stochastic calculus is full of formalized abuses of notation $-$ is usually made by dropping the average, i.e. $\mathrm{d}B_t^2 = \mathrm{d}t$.
As before, higher-order terms are again considered as negligible in the limit $\mathrm{d}t \rightarrow 0$, because they would vanish when computing a yet-to-be-formalized derivative $\frac{\mathrm{d}f(X_t)}{\mathrm{d}t}$ (with $\frac{\mathrm{d}B_t}{\mathrm{d}t}$ being interpreted as a white noise in that case). In consequence, the $o(\mathrm{d}t)$ terms, i.e. all the supralinear terms with respect to $\mathrm{d}t$, are (implicitly) cut from the initial Taylor expansion. In that point of view, $\mathrm{d}B_t$ is kept because $\mathrm{d}B_t = \mathcal{N}(0,\mathrm{d}t) = \mathcal{N}(0,1)\sqrt{\mathrm{d}t} \sim \mathrm{d}t^{1/2}$, but $\mathrm{d}B_t\mathrm{d}t$ is ruled out because $\mathrm{d}B_t\mathrm{d}t \sim \mathrm{d}t^{3/2} = o(\mathrm{d}t)$.
Final remark
All the above developments are valid when the stochastic process $X_t$ is made of a deterministic component, represented by the drift term $a_t\mathrm{d}t$, and a random phenomenon modelled by a normal distribution, represented here by the gaussian noise $b_t\mathrm{d}t$. When the random event in question is not gaussian, for example in the case of a Poisson process, then you will need to adapt and rederive Itô's lemma, because the relation $\mathrm{d}B_t^2 \sim \mathrm{d}t$ is not true anymore.