In multivariable calculus, the gradient of a function $f(x,y)$ at a given point is the covariant vector:
$$\vec\nabla_f=\begin{bmatrix}\frac{\partial f}{\partial x},\frac{\partial f}{\partial y}\end{bmatrix}$$
while the corresponding contravariant vector of differentials is
$$\mathrm d\vec x=[\mathrm dx, \mathrm dy]$$
Their dot product results in the total differential of the function at that given point:
$$\mathrm df=\frac{\partial f}{\partial x}\mathrm dx+\frac{\partial f}{\partial y}\mathrm dy.$$
So far, so good. Except that this last calculation is supposed to yield a scalar value according to Reflections on Relativity by Kevin Brown (*):
"dy [read df for notational consistency] equals the scalar (dot) product of these two vectors... The scalar quantity dy is called the total differential of y."
At the same time, this last calculation is also referred to as a 1-form; a (0,1)-tensor; or a covector (actually an element of the cotangent space), thus enabling the calculation of the directional derivative at the point under consideration, and in the direction of a vector $\mathbf {\vec v}$ in the tangent space , as the dot product:
$$\langle \mathrm df , \mathbf{\vec v}\rangle\to\mathbb R.$$
The question:
Since $\mathrm df$ cannot be a scalar and a covector at the same time, which one is it, and what's wrong in the above definitions that explains this apparent contradiction?
(*) Please note that this seems to be a slight of hand (or possible in-correction in the quoted source after the accepted answer and its associated comment: " Mathematically, total differentials are covectors (or covector fields). You may apply them to vectors (or vector fields) in order to obtain scalars (or scalar fields). But they themselves should not be thought as scalars." In the accepted answer the total differential is referred to as the "differential operator at $p$ of $f.$"
Short answer:
Technically, $\mathrm{d}f$ is a covector, not a scalar.
Long, detailed answer:
First, some linear algebra. Take any real $n$-dimensional vector space $V$ and pick a(n ordered) basis $(e_i)$. Then you may pair all vectors in $v$ with $n$-tuples of real numbers so that $v^1 e_1 + \cdots + v^n e_n \leftrightarrow (v^1, \cdots, v^n)$. For vectors, the convention is adopted that these $n$-tuples be written in a vertical fashion – i.e. a second isomorphism is established between $\mathbb R^n$ and the space of column vectors ($n\times 1$ matrices).
Now define the dual of $V$ as the vector space of linear functionals on $V$, i.e., linear maps $\varphi : V \to \mathbb R$. Our choice of basis $(e_i)$ on $V$ induces a natural choice of basis on $V^*$, the dual basis $(\varepsilon^j)$ such that $\varepsilon^j(e_i) = \delta^j_i$ (the Kronecker delta: it corresponds to the value $1$ when $i=j$, $0$ otherwise). This, way, we may also pair every linear functional $\varphi \in V^*$ with an $n$-tuple of real numbers $(\varphi_1,\dots, \varphi_n)$, and the convention now is that they are written in a horizontal fashion, i.e. as $1 \times n$ matrices.
Why the row-column vector convention? Because now the action of $\varphi \in V^*$ on $v \in V$ can be written $$\langle \varphi, v \rangle := \varphi(v) = \begin{pmatrix} \varphi_1 & \cdots & \varphi_n\end{pmatrix} \begin{pmatrix} v^1 \\ \vdots \\ v^n \end{pmatrix} = \varphi_1 v^1 + \cdots \varphi_n v^n $$ so that applying a linear functional to a vector is the same as matrix-multiplying the row vector on the left to the column vector and obtaining a scalar, as you would expect. (The map $\langle\cdot,\cdot\rangle : V^* \times V \to \mathbb R$ is called duality product and it is very close to the notion of dot product.) It is worthy of notice that sometimes linear functionals are called $1$-covariant tensors or $(0,1)$-tensors over $V$ or covectors and vectors in $V$ are called $1$-contravariant tensors or $(1,0)$-tensors over $V$.
Back to geometry/analysis. Call $\Omega$ an open subset of $n$-dimensional Euclidean space. Remember that the "partial derivative in the $k$-th direction evaluated at a point $p=(p^1, \dots, p^n) \in \Omega$" operation, $\frac{\partial}{\partial x^k}\Big|_{p=(p^1,\dots,p^n)}$, is linear, that is, if $a,b \in \mathbb R$ and $f,g : \Omega \to \mathbb R$ are smooth functions, $$\begin{split} \frac{\partial}{\partial x^k}\Bigg|_{p} (af + bg) &= \frac{\partial (af + bg)}{\partial x^k}(p) \\ &=a \frac{\partial f}{\partial x^k}(p) + b \frac{\partial g}{\partial x^k}(p) \\ &= a \frac{\partial}{\partial x^k}\Bigg|_{p} f + b \frac{\partial}{\partial x^k}\Bigg|_{p} g \end{split}$$ so it makes sense to consider the vector space of derivations at $p$, that is, the vector space $T_p\Omega$ (also called "tangent space at $p$") of all linear combinations of partial differential operators $$\sum_{k=1}^n a^k \frac{\partial}{\partial x^k}\Bigg|_{p} = a^1 \frac{\partial}{\partial x^1}\Bigg|_{p} + \cdots + a^n\frac{\partial}{\partial x^n}\Bigg|_{p} $$ It is clear that $\left(\frac{\partial}{\partial x^k}\Big|_{p}\right)$ constitutes an ordered basis of $T_p\Omega$, which is then $n$-dimensional. Of course, you may apply the column-vector convention and identify the above linear combination with the vertical list $\begin{pmatrix} a^1 \\ \vdots \\ a^n \end{pmatrix}$.
What is then its dual space? It's the "cotangent space at $p$", the space $T^*_p\Omega$ of all covectors on $T_p\Omega$. The natural choice of basis on $T^*_p\Omega$ is the dual basis of covectors that yield the Kronecker delta when applied to the pure partial differential operators at $p$. These are the maps $$\mathrm{d}x^m|_p : T_p\Omega \to \mathbb R \qquad \text{s.t.} \qquad \mathrm{d} x^m|_p \left(\frac{\partial}{\partial x^k}\Bigg|_{p}\right) = \delta^m_k $$ so that any covector $\alpha|_p \in T_p^*\Omega$ can be written as $$\alpha|_p = \alpha_1 \mathrm{d}x^1|_p + \cdots + \alpha_n \mathrm{d}x^n|_p. $$
Now, given a smooth function $f : \Omega \to \mathbb R$, the differential operator at $p$ of $f$ is the covector in $T_p^*\Omega$ such that the coefficients $\alpha_m$ are exactly $ \alpha_m = \frac{\partial f}{\partial x^m}(p)$, so that $$\mathrm{d}f|_p = \frac{\partial f}{\partial x^1}(p)\mathop{} \mathrm{d}x^1|_p + \cdots + \frac{\partial f}{\partial x^n}(p) \mathop{}\mathrm{d}x^n|_p$$ This also makes sense out of the notation $\mathrm{d}x^m|_p$: the basis elements are exactly the differentials of the coordinate functions $ x^m : p \mapsto p^m$. Of course, you may now apply the row-vector convention to get the row-gradient of $f$ $$\begin{pmatrix} \dfrac{\partial f}{\partial x^1}(p) & \cdots & \dfrac{\partial f}{\partial x^n}(p) \end{pmatrix} $$ The gradient of $f$ is transpose of this vector, i.e. the column vector $\nabla f(p)$ you get by turning this row vector $90$ degrees clockwise. Thus, applying the row-gradient to a vector is the same thing as dotting the gradient to that same vector.
The directional derivative. Now you can guess how this plays out. Let $p \in \Omega$, $v|_p \in T_p\Omega$ and let $f$ be a smooth function over $\Omega$. Then the directional derivative of $f$ at $p$ in the direction $v|_p$ is just the scalar $$\begin{split} \langle \mathrm d f|_p, v|_p \rangle = \mathrm{d}f|_p (v|_p) &= \begin{pmatrix} \dfrac{\partial f}{\partial x^1}(p) & \cdots & \dfrac{\partial f}{\partial x^n}(p) \end{pmatrix} \begin{pmatrix} v^1|_p \\ \vdots \\ v^n|_p \end{pmatrix} = \nabla f(p) \cdot \begin{pmatrix} v^1|_p \\ \vdots \\ v^n|_p \end{pmatrix} \\ &= v^1|_p \dfrac{\partial f}{\partial x^1}(p) + \cdots + v^n|_p \dfrac{\partial f}{\partial x^n}(p) \end{split}$$ and we are done.
Addendum. So far we've been working in the tangent space at a single point $p \in \Omega$. But nothing prevents us from defining maps that assign to each point $p \in \Omega$ a vector or a covector. These are called fields (although these should be, in a more sophisticated picture, sections of the tangent or cotangent bundle of $\Omega$). Covector fields are also called $1$-forms; for example, $\mathrm{d}f$ is a one form (to each $p$ we assign $\mathrm{d} f|_p$). One may extend the operation of applying a covector to a vector to fields: if $v$ is the vector field such that to each $p$ is associated a vector $v|_p \in T_p\Omega$, and $\alpha$ is a covector field defined similarly, then $\langle\alpha,v\rangle$ is the scalar field (i.e., real function) that associates to each $p$ the real number $\langle \alpha|_p,v|_p\rangle = \alpha|_p(v|_p)$. If $\alpha = \mathrm d f$, you get an extension of the definition of directional derivative we gave above: just picture a field of vectors distributed over $\Omega$ and a function $f$ with its gradient, and imagine dotting the gradient at a point with the specific vector assigned at that point, and doing so for all points in $\Omega$. In some way, this gives an idea of "how well" the gradient field conforms to the given vector field.