Derivation of Gradient Descent Update Rule

346 Views Asked by Bumbble Comm At 02 Apr 2026 - 10:41

I am trying to understand the interpretation of Gradient Descent provided in Ryan Tibshirani's 2019 Convex Optimization course scribed notes:

http://www.stat.cmu.edu/~ryantibs/convexopt-S15/scribes/05-grad-descent-scribed.pdf

Since they're for me the clearest explanation of the topic I could find.

While I can follow how the function can be approximated at point y as:

$$f(y) ≈ f(x) + ∇f(x)^T (y − x) + \frac{1}{2}(y − x)^T ∇^2f(θ(x − y) + y)(y − x)$$

Which by replacing $∇^2f(θ(x − y) + y)$ with $\frac{1}{t}I$ can be rewritten as:

$$f(y) ≈ f(x) + ∇f(x)^T(y − x) + \frac{1}{2t}||y − x||_2^2 = g(y)$$

I have trouble following the first iff statement given here:

$$∇g(y) = 0 ⇔ ∇f(x)+ \frac{1}{t}(y −x) = 0 ⇔ y = x−t∇f(x)$$

Which, if I understand correctly, derives the GD update rule by first setting its derivative of $g(y)$ to 0, which then results in $∇f(x)+ \frac{1}{t}(y −x)$.

My guess is that the author (like others I have tried to consult, which provide the same reasoning in a slightly different form - see for example slide 19 here: https://www.cs.huji.ac.il/~shais/Lectures2014/lecture6.pdf, from Shai Shalev-Shwartz's lecture notes, adapted from page 185 of the 2014 book "Understanding Machine Learning:From Theory to Algorithms" by the same author with Shai Ben-David) does not provide a step-by-step explanation of this passage due to its triviality.

But I would nevertheless really appreciate if someone could mathematically break this specific passage down and point to the rules behind it.

Original Q&A

There are 1 best solutions below

Bumbble Comm On 17 Jul 2021 - 11:18 BEST ANSWER

$$g(y)=f(x) + \nabla f(x)^T(y-x) + \frac1{2t}\|y-x\|_2^2$$

Differentiating with respect to $y$,

$$\nabla g(y) = \nabla f(x) + \frac1{t}(y-x)$$

Hence $$\nabla g(y)=0 \iff \nabla f(x) + \frac1t (y-x)=0$$

which is equivalent to $$\frac1t(y-x) = -\nabla f(x)$$

$$y=x-t\nabla f(x)$$

Note that we have used

$$\nabla_y (a^Ty)=a$$

since $$\frac{\partial }{\partial y_i}\sum_{j=1}^n a_iy_j = a_i$$

and

$$\nabla _y \|y\|^2 = \nabla_y (y^Ty)=2y$$

since $$\frac{\partial }{\partial y_i}\sum_{j=1}^n y_j^2 = 2y_i$$

Derivation of Gradient Descent Update Rule

There are 1 best solutions below

Related Questions in OPTIMIZATION

Related Questions in CONVEX-OPTIMIZATION

Related Questions in NUMERICAL-OPTIMIZATION

Related Questions in GRADIENT-DESCENT

Trending Questions

Popular # Hahtags

Popular Questions