From Wikipedia, we have:
Let $E \to M$ be a smooth vector bundle over a differentiable manifold $M$. Let $\Gamma(E)$ be the space of all smooth sections. A connection on $E$ is an $\mathbb{R}$-linear map $\nabla : \Gamma(E)→\Gamma(E\otimes T^{*}M)$ such that $\nabla(\sigma f)=\nabla(\sigma)f+\sigma \otimes df$ for all smooth functions $f$ on $M$ and all sections $\sigma$.
I don't understand why we define connections on a vector bundle in this way, intuitively, we need to define a connection so that we can parallel transport a vector along a curve, but I still don't know why we define the connection in this way.
$\newcommand{\R}{\mathbb{R}}$ Here is an elaboration of @TedShifrin's comment.
The concept of a covariant derivative arises naturally, because it extends the definitions of the directional derivative and differential of a function to sections of a vector bundle. Recall that given a function $f: M \rightarrow \R$ and a tangent vector $v \in T_pM$, the directional derivative is defined to be $$ d_vf(p) = \left.\frac{d}{dt}\right|_{t=0} f(c(t)) \in \R, $$ where $c$ is a parameterized curve such that $c(0) = p$ and $c'(0) = v$. This has the following notable properties:
(These properties comprise the definition of a real-valued derivation of a function, and the directional derivative is the unique such derivation)
It is natural, when studying a vector bundle $E$ over $M$, to try to define a way to differentiate a section of a vector bundle, which generalizes the concept of a function. So a natural question is whether one can define the concept of a directional derivative of a section with properties analogous to the ones listed above. So, given a section $\sigma$ and a tangent vector $v \in T_pM$, we want to define $d_v\sigma(p) \in E_p$ such that
If you write everything locally in terms of local coordinates on $M$ and a trivialization of $E$, you discover, that unlike the directional derivative of a function, there is no unique definition of a directional derivative of a section.
This leads to the concept of a connection, which is simply one possible way to define directional derivatives of a section.
For me, parallel transport now arises as the natural generalization of a constant function. Again, using local coordinates and a trivialization, you can see that in general constant sections (all directional derivatives are zero) on an open set do not exist. However, along a curve, one can solve for a section along the curve with zero tangential derivative by solving an ODE. So that means any connection defines parallel transport.