When we define a connection $\nabla$ it follows naturally the definition of the covariant derivative as $\nabla_b X_a$ as it is well known. The following step is to consider vector field parallel transported. If we take a curve $\gamma: [a,b] \longrightarrow \mathcal{M} $ and a vector field $\mathbf{V}$ we can say it's a parallel transported vector field if $\nabla_{\mathbf{X}(t)}\mathbf{V}(t) = 0 \ { }\forall t \in [a,b]$. In a second moment we can define a map $\mathbf{P}_\gamma: T_{\gamma(a)}\mathcal{M} \rightarrow T_{\gamma(b)}\mathcal{M}$ that maps the vector $\mathbf{V}(a)$ to the vector $\mathbf{V}(b)$ and we can say that this application gives the notion of parallel transport of vector.
What I miss is why in the majority of books it's always said that the $\nabla_{\mathbf{X}}\mathbf{Y}$ is the parallel transport of $\mathbf{Y}$ along the curve $\gamma$ whose tangent vector is $\mathbf{X}$ if by definition if $\mathbf{Y}$ is parallel transported its covariant derivative along $\mathbf{X}$ is $0$? I hope the question is clear, if it's not I'm here for clarification ( I'm here for that anyway).