I am trying to understand the interpretation of Gradient Descent provided in Ryan Tibshirani's 2019 Convex Optimization course scribed notes:
http://www.stat.cmu.edu/~ryantibs/convexopt-S15/scribes/05-grad-descent-scribed.pdf
Since they're for me the clearest explanation of the topic I could find.
While I can follow how the function can be approximated at point y as:
$$f(y) ≈ f(x) + ∇f(x)^T (y − x) + \frac{1}{2}(y − x)^T ∇^2f(θ(x − y) + y)(y − x)$$
Which by replacing $∇^2f(θ(x − y) + y)$ with $\frac{1}{t}I$ can be rewritten as:
$$f(y) ≈ f(x) + ∇f(x)^T(y − x) + \frac{1}{2t}||y − x||_2^2 = g(y)$$
I have trouble following the first iff statement given here:
$$∇g(y) = 0 ⇔ ∇f(x)+ \frac{1}{t}(y −x) = 0 ⇔ y = x−t∇f(x)$$
Which, if I understand correctly, derives the GD update rule by first setting its derivative of $g(y)$ to 0, which then results in $∇f(x)+ \frac{1}{t}(y −x)$.
My guess is that the author (like others I have tried to consult, which provide the same reasoning in a slightly different form - see for example slide 19 here: https://www.cs.huji.ac.il/~shais/Lectures2014/lecture6.pdf, from Shai Shalev-Shwartz's lecture notes, adapted from page 185 of the 2014 book "Understanding Machine Learning:From Theory to Algorithms" by the same author with Shai Ben-David) does not provide a step-by-step explanation of this passage due to its triviality.
But I would nevertheless really appreciate if someone could mathematically break this specific passage down and point to the rules behind it.
$$g(y)=f(x) + \nabla f(x)^T(y-x) + \frac1{2t}\|y-x\|_2^2$$
Differentiating with respect to $y$,
$$\nabla g(y) = \nabla f(x) + \frac1{t}(y-x)$$
Hence $$\nabla g(y)=0 \iff \nabla f(x) + \frac1t (y-x)=0$$
which is equivalent to $$\frac1t(y-x) = -\nabla f(x)$$
$$y=x-t\nabla f(x)$$
Note that we have used
$$\nabla_y (a^Ty)=a$$
since $$\frac{\partial }{\partial y_i}\sum_{j=1}^n a_iy_j = a_i$$
and
$$\nabla _y \|y\|^2 = \nabla_y (y^Ty)=2y$$
since $$\frac{\partial }{\partial y_i}\sum_{j=1}^n y_j^2 = 2y_i$$