I am trying to understand the definition of Besov spaces. With such a complicated definition I wonder what is the motivation behind them and why are they so often used in PDE? What advantage do they give over Sobolev spaces?
Are there any nice (hopefully short) references that introduce them?
Aside from the pointwise characterization (i.e. the Sobolev–Slobodeckij spaces) mentioned in the comment. I believe the best motivation whose norm requires the complicated Littlewood-Paley decomposition is the theory of paraproduct.
Think about the Leibniz's rule on Sobolev spaces $$\|fg\|_{W^{k,r}}\le\|f\|_{W^{k,p}}\|g\|_{L^q}+\|f\|_{L^p}\|g\|_{W^{k,q}},\quad\frac1p+\frac1q=\frac1r.$$ In order to make it more useful in PDEs we want to consider the concrete decomposition $fg=T_1(f,g)+T_2(f,g)$ such that $T_1:W^{k,p}\times L^q\to L^r$ and $T_2:L^p\times W^{k,q}\to L^r$ are bounded bilinear operators.
Roughly speaking $D^k T_1(f,g)\approx T_1(D^kf,g)$, and $T_1$ captures the "high frequency" of $f$. Similarly $D^k T_2(f,g)\approx T_2(f,D^kg)$.
To capture this idea one may want to use the Littlewood-Paley decomposition $f=\sum_jf$ and $g=\sum_kg_k$ (where $j,k\in\mathbb Z$ or $j,k\ge0$ depending on the context). In this decomposition $T_i(f,g)=\sum_{(j,k)\in\Lambda_i}f_jg_k$ where $\Lambda_1$ and $\Lambda_2$ are partition of the index space for $(j,k)$.
In general this also work for Sobolev spaces, which are special case of Triebel-Lizorkin spaces. But for the paraproduct decomposition along with their estimates, the Besov spaces should the best starting point.
I would recommend the book Fourier Analysis and Nonlinear Partial Differential Equations by Bahouri, Chemin and Danchin.