The gradient is usually written as the product of the unit vectors times the derivative with respect to that coordinate. In Einstein summation convention:
$\hat e_i \partial_i$
I've seen it written as so in some places.
Is this wrong and is one of them supposed to be a contravariant vector, because otherwise it won't transform as a tensor between coordinate system?
I would say that "this definition depends on the orthonormality of $(\hat{e}_i)$", rather than it is "wrong", but that's it. More generally, if $\{e_1 \ldots e_n\}$ is a basis of $\mathbb{R}^n$ and $\{e^1\ldots e^n\}$ is the dual basis (that is, the only set of vectors with property $e^i \cdot e_j = \delta^i{ }_j$) then
$$\nabla f =\frac{\partial f}{\partial x^i}e^i.$$
P.S.: You can read more on Itskov's book Tensor Algebra and Tensor Analysis for Engineers: http://books.google.it/books?id=8FVk_KRY7zwC&lpg=PP1&hl=it&pg=PA42#v=onepage&q&f=false