I'm trying to understand the essential difference between two common types of the convergence of random processes: the weak convergence of the finite-dimensional distributions (fdds) and the convergence in distribution in some function space (for example, $\mathcal D[0,1]$ or $\mathcal C[0,1]$).
Suppose that $\xi_1,\xi_2,\ldots$ are i.i.d. random variables and with zero means and unit variances. Define $$ W_n(t)=\frac1{\sqrt n}\sum_{k=1}^{\lfloor nt\rfloor}\xi_k $$ for $n\ge1$ and $t\in[0,1]$ and let $W$ be the standard Wiener process.
We know that the fdds of $\{W_n:n\ge1\}$ converge weakly to the fdds of $W$. Furthermore, by Donsker's theorem, as random variables taking values in the Skorokhod space $\mathcal D[0,1]$, the random functions $\{W_n:n\ge1\}$ converge in distribution to $W$. Since the Skorokhod space $\mathcal D[0,1]$ is complete and separable, the convergence in distribution is equivalent to the weak convergence of the fdds and the tightness of the corresponding probability measures. So the convergence in distribution in $\mathcal D[0,1]$ implies the weak convergence of the fdds. Hence, the convergence in distribution in $\mathcal D[0,1]$ is a stronger result.
It seems that I understand the mathematical meaning of these two types of convergence and the fact that convergence in distribution in $\mathcal D[0,1]$ implies the weak convergence of the fdds. But are there any reasons to prove the convergence in distribution in some function space apart from the fact that this is a stronger result? What is the difference between convergence in distribution in $\mathcal C[0,1]$ and $\mathcal D[0,1]$? Why don't we investigate convergence in $L_2[0,1]$, for example? What are the reasons to choose a particular function space?
Some examples would be really great and any help is much appreciated!
There are a lot of questions here, so I may have missed some of them.
$\textbf{Why study convergence on $\mathcal{D}[0,1]$ rather than $L^2[0,1]$?}$
Let me start by mentioning that people do study convergence of stochastic processes on $L^2[0,1]$.
Often, we are interested in convergence of certain observables of a stochastic process, rather than just convergence of the stochastic processes itself. For example, suppose that $X_n \Longrightarrow X$ in $L^2[0,1]$ and we are interested in the supremum of $X$. Is it true that $\sup_{0 \leq t \leq 1} X_n (t) \Longrightarrow \sup_{0 \leq t \leq 1} X(t)?$ Clearly not; the left and right hand sides are not even well-defined in the $L^2[0,1]$ sense (because functions only make sense up to sets of measure zero). If we are interested in studying the convergence of the supremum of a stochastic process, we then need to work in another space.
The key reasons to work with the Skorokhod J1 metric (or, rather, a slight modification of it) on $\mathcal{D}[0,1]$ is that it turns the space of right-continuous functions with left limits on $[0,1]$ into a complete, separable metric space whose Borel sigma algebra is generated by the coordinate projections (Ethier and Kurtz Proposition 3.7.1) in which most (but definitely not all) functionals we are interested in are continuous. For example (Ethier and Kurtz Exercise 3.11.26) the following maps from $\mathcal{D}[0,1]$ to $\mathcal{D}[0,1]$ are all continuous in this topology:
1.) $x \in \mathcal{D}[0,1] \mapsto (t \mapsto \sup_{0 \leq s \leq t} x(s)) \in \mathcal{D}[0,1]$
2.) $x \in \mathcal{D}[0,1] \mapsto (t \mapsto \inf_{0 \leq s \leq t} x(s)) \in \mathcal{D}[0,1]$
3.) $x \in \mathcal{D}[0,1] \mapsto (t \mapsto \int_0^t x(s) ds) \in \mathcal{D}[0,1]$
4.) $x \in \mathcal{D}[0,1] \mapsto (t \mapsto \sup_{0 \leq s \leq t} (x(s) - x(s-)) \in \mathcal{D}[0,1]$
$\textbf{What is the difference between convergence in $\mathcal{D}[0,1]$ and $\mathcal{C}[0,1]$?}$
$\mathcal{D}[0,1]$ is a bigger space than $\mathcal{C}[0,1]$ and sometimes we want to work with processes that have jumps (like the Poisson process). It is not terribly hard to show, however, that if you equip $\mathcal{C}[0,1]$ with the Skorokhod metric, then you get the usual topology of uniform convergence on $\mathcal{C}[0,1]$. Skorokhod mentions this just before defining the topology in his 1956 paper "Limit theorems for stochastic processes" for example. We can actually say a little bit more.
If you have a sequence of variables $X_n \Longrightarrow X$ in $\mathcal{D}[0,\infty)$, then $X$ is a.s. continuous if and only if $\int_0^\infty e^{-s}(1 \wedge\sup_{0 \leq r \leq s} |X_n(r) - X_n(r-)|)ds \Longrightarrow 0$ (Ethier and Kurtz Theorem 3.10.2). I switched to $\mathcal{D}[0,\infty)$ here to avoid problems at the endpoint of the interval, which are a real pain when working on the Skorokhod space on a finite interval. One can often show that for all $s$, $|X_n(s) - X_n(s-)| \leq C_n$ with $C_n \to 0$, which gives an accessible sufficient condition for convergence to a continuous process. For example, if $N(t)$ is a rate one Poisson process, then in order to show that $N^{(n)}(t) = \frac{1}{\sqrt{n}}(N(n t) - nt)$ converges to Brownian Motion, one might want to use the fact that any limit point of $N^{(n)}$ is continuous, which is immediate from the fact that the jumps of $N^{(n)}$ have size at most $\frac{1}{\sqrt{n}}$.
$\textbf{Example: Integral of the maximum of a random walk.}$
For a concrete example that would be quite hard (maybe not possible?) to do using finite dimensional distributions, let $\{X_i\}_{\{i \geq 0\}}$ be i.i.d. with $E X_i = 0, E X_i^2 = 1$ and set $S_n = \sum_{i=0}^n X_i$. Donsker's theorem on the Skorokhod space says that for $S^{(n)}(t) = \frac{1}{\sqrt{n}} S_{\lfloor nt \rfloor}$, $(t \mapsto S^{(n)}(t) ) \Longrightarrow \left( t \mapsto B(t)\right)$ where $B(t)$ is Brownian motion. Now consider the continuous map $g : \mathcal{D}[0,1] \to \mathcal{D}[0,1]$ given by
\begin{align*} x \mapsto \left(t \mapsto \int_0^t \max_{0 \leq r \leq s} x(r) ds\right) \end{align*}
Applying Donsker's theorem and the continuous mapping theorem, we see that
\begin{align*} \left( t \mapsto \int_0^t \max_{0 \leq r \leq s} \frac{1}{\sqrt{n}} \sum_{i=0}^{\lfloor n r \rfloor} X_i d s\right) = g(S^{(n)}) \Longrightarrow g(B) = \left( t \mapsto \int_0^t \max_{0 \leq r \leq s} B(r) ds \right) \end{align*}
In particular, then we have \begin{align*} \int_0^1 \max_{0 \leq r \leq s} \frac{1}{\sqrt{n}} \sum_{i=0}^{\lfloor n r \rfloor} X_i d s \Longrightarrow \int_0^1 \max_{0 \leq r \leq s}B(s) ds \end{align*}