Difficulty understanding Armijo rule in the context of backpropagation

64 Views Asked by Bumbble Comm At 25 Mar 2026 - 7:04

Let $f$ be a neural network, $f(X, P)$ the forward pass with $X$ the input values and $P$ a list of weight matrices and bias vectors. Backpropagation involves calculating $\partial P$, which is a partial derivative of each learnable parameter in $P$ of a loss function $\mathcal{L}$. Updating parameters on $k$-th iteration is then $P^{k+1} = P^k - t^k \cdot \frac{\partial \mathcal{L}}{\partial P}|_{P^k}$.

The Armijo rule tells then how to select the learning rate $t^k$ on $k$-th iteration. The materials that I follow state it as:

Fix an initial step $s > 0$, parameter $\beta \in (0, 1)$, and $\sigma \in (0,1)$. We choose the step size $t^k = s\beta ^{m_k}$, where $m_k$ is the first nonnegative integer for which $$f(x^k) - f(x^k + t^kd^k) \geq - \sigma t^k \nabla f(x^k)^T d^k$$ where $d^k$ is such that $\nabla f(x^k)^T d^k < 0$.

Question: If $x^k$ in the above definition corresponds to parameter $P^k$, and $d^k$ is the gradient, i.e. the direction of the steepest descent then is $\nabla f(x^k)^T d^k$ just a flattened $\frac{\partial \mathcal{L}}{\partial P}|_{P^k}$ dotted with itself? But if that is the case, then how can $\nabla f(x^k)^T d^k$ be less than zero?

Original Q&A

Difficulty understanding Armijo rule in the context of backpropagation

Related Questions in MACHINE-LEARNING

Related Questions in NUMERICAL-OPTIMIZATION

Related Questions in GRADIENT-DESCENT

Related Questions in NEURAL-NETWORKS

Trending Questions

Popular # Hahtags

Popular Questions