Let $f$ be a neural network, $f(X, P)$ the forward pass with $X$ the input values and $P$ a list of weight matrices and bias vectors. Backpropagation involves calculating $\partial P$, which is a partial derivative of each learnable parameter in $P$ of a loss function $\mathcal{L}$. Updating parameters on $k$-th iteration is then $P^{k+1} = P^k - t^k \cdot \frac{\partial \mathcal{L}}{\partial P}|_{P^k}$.
The Armijo rule tells then how to select the learning rate $t^k$ on $k$-th iteration. The materials that I follow state it as:
Fix an initial step $s > 0$, parameter $\beta \in (0, 1)$, and $\sigma \in (0,1)$. We choose the step size $t^k = s\beta ^{m_k}$, where $m_k$ is the first nonnegative integer for which $$f(x^k) - f(x^k + t^kd^k) \geq - \sigma t^k \nabla f(x^k)^T d^k$$ where $d^k$ is such that $\nabla f(x^k)^T d^k < 0$.
Question: If $x^k$ in the above definition corresponds to parameter $P^k$, and $d^k$ is the gradient, i.e. the direction of the steepest descent then is $\nabla f(x^k)^T d^k$ just a flattened $\frac{\partial \mathcal{L}}{\partial P}|_{P^k}$ dotted with itself? But if that is the case, then how can $\nabla f(x^k)^T d^k$ be less than zero?