The gradient of a function ##f:\mathbb R^n\to\mathbb R## is the function ##\nabla f:\mathbb R^n\to\mathbb R^n## defined by
$$\nabla f(x)=(f_{,1}(x),\dots,f_{,n}(x)),$$ for all ##x\in\mathbb R^n##. For each ##i\in\{1,\dots,n\}##, ##f_{,i}## denotes the ith partial derivative of f. In differential geometry, partial derivatives are defined using both a coordinate system and the conventional type of partial derivatives. For example, if ##x:U\to\mathbb R^n## is a coordinate system on ##U\subseteq\mathbb R^n##, and ##p\in U##, then for all ##i\in\{1,\dots,n\}##, we have
$$\frac{\partial}{\partial x^i}\bigg|_p f= (f\circ x^{-1})_{,i}(x(p)).$$ This statement defines the notation on the left.
The conventional partial derivatives in a gradient can be interpreted as partial derivatives in the sense of differential geometry, if we use the fact that the identity map ##I##, defined by ##I(x)=x## for all ##x\in\mathbb R^n##, is a coordinate system. We have
$$\frac{\partial}{\partial I^i}\bigg|_p f = (f\circ I^{-1})_{,i}(I(p)) = f_{,i}(p).$$ To see how partial derivatives in the sense of differential geometry transform under a change of coordinates ##x\to y##, we need to use the chain rule:
\begin{align}
\frac{\partial}{\partial y^i}\bigg|_p f &=(f\circ y^{-1})_{,i}(y(p)) = (f\circ x^{-1}\circ x\circ y^{-1})_{,i}(y(p))= (f\circ x^{-1})_{,j} \big((x\circ y^{-1})(y(p))\big)\, (x\circ y^{-1})^j{}_{,i}(y(p))\\
& = (x\circ y^{-1})^j{}_{,i}(y(p)) \frac{\partial}{\partial x^j}\bigg|_p f.
\end{align} Is the transformation
$$\frac{\partial}{\partial x^i}\bigg|_p \to \frac{\partial}{\partial y^i}\bigg|_p =(x\circ y^{-1})^j{}_{,i}(y(p)) \frac{\partial}{\partial x^j}\bigg|_p$$ covariant or contravariant? Well, "covariant" means that the components transform the same way as the basis vectors, but the partial derivative functionals ##\frac{\partial}{\partial x^i}\big|_p## are the basis vectors (of the tangent space at p) associated with the coordinate system x. So the transformation is by definition covariant.
I guess this changes the question to why the coordinate n-tuple ##(x^1(p),\cdots,x^n(p))## that a coordinate system x associates with a point ##p\in\mathbb R^n## transforms contravariantly. They don't always. Under the coordinate change ##x\to y##, ##x(p)## changes to
$$y(p)=(y\circ x^{-1}\circ x)(p)= (y\circ x^{-1})(x(p)).$$ To proceed from here, an assumption is necessary. We assume that ##y\circ x^{-1}## is a linear bijection from ##\mathbb R^n## to ##\mathbb R^n## (for example a rotation or a Lorentz transformation). The ##i##th component of the matrix equation corresponding to the above (see https://www.physicsforums.com/showthread.php?t=694922 if you don't understand that concept) is
$$(y(p))^i = (y\circ x^{-1})^i{}_j (x(p))^j.$$ Let T be an arbitrary linear bijection from ##\mathbb R^n## to ##\mathbb R^n##. For all ##x\in\mathbb R^n## (apologies for using the symbol x for a second purpose), we have
\begin{align}
&T^i(x)=T^i{}_j x^j\\
&T^i_{,k}(x)=T^i{}_j \delta^j_k =T^i{}_k.
\end{align} This implies that ##(y\circ x^{-1})^i{}_{,j}(x(p)) =(y\circ x^{-1})^i{}_j##. So we have
$$y^i(p)=(y(p))^i =(y\circ x^{-1})^i{}_j (x(p))^j = (y\circ x^{-1})^i{}_{,j}(x(p))\, x^j(p).$$ As you can see, the numbers ##(y\circ x^{-1})^i{}_{,j}(x(p))## that appear in this transformation equation are not the same as the numbers ##(x\circ y^{-1})^j{}_{,i}(y(p))## that appear in the transformation equation for the components of the gradient. However, we have
\begin{align}
\delta^j_k &=I^j{}_{,k}(x(p))= (x\circ y^{-1}\circ y\circ x^{-1})^j{}_{,k}(x(p)) =(x\circ y^{-1})^j{}_{,i}(y(p)) (y\circ x^{-1})^i{}_{,k}(x(p)).
\end{align} This is how we see that coordinate n-tuples transform contravariantly, i.e. using the inverse of the matrix that's used to transform the basis vectors.