How is Learning Rate is multiplied by Gradient/Slope value in Step Size in Gradient Descent Algorithm

440 Views Asked by At

As I know, gradient of a function gives a vector that shows the direction and rate of change on a point on that function. Now formula for new intercept is : New Intercept = Previous Intercept - Step Size. And Step Size is equal to Learning Rate multiplied by Gradient. $$x_{new} = x_{previous} - StepSize$$ or: $$x_{k+1} = x_k - LR*∇f(x)$$ THE QUESTION IS : For a function like $f(x)=y$, if gradient gives us the rate of change of $y$ value according to change in $x$ value, how are we able to subtract it from previous $x$ value(intercept value) to find new $x$ value? I mean, isn't it like subtracting $y/x$ from $x$ to find a new $x$ like this : $$x_{k+1} = x_k - LR*{Δy/Δx}$$ $x$ and $y$ (in this case $x$ for intercept values and $y$ for residual sum of squares/cost function/loss function values) are totally different values on different axis so i couldn't understand how.

1

There are 1 best solutions below

0
On

There's probably a confusion about "step size". The step size is a scalar that scales the descent direction (in the case of gradient descent, -gradient). So it would be the same as your learning rate.

In a minimization problem, the idea of gradient descent is to use gradient information to decrease the objective function. Given the current iterate $x^{(k)}$, you try to find a direction $d^{(k)}$ that will decrease your objective function $f$: $f(x^{(k)} + d^{(k)}) < f(x^{(k)})$.

According to calculus, the gradient $\nabla f(x^{(k)})$ of $f$ at the current point $x^{(k)}$ is a direction of increase, so $-\nabla f(x^{(k)})$ should be a direction of decrease. We also know that the tangent (whose normal vector is the gradient) of $f$ at $x^{(k)}$ accurately represents $f$ close to $x^{(k)}$, but the farther we go, the likelier the function and its tangent do not "agree" any more. So you don't want to go to far from $x^{(k)}$.

That's why you scale your step (a fraction of the direction $x^{(k)}$) with a scalar $\alpha^{(k)} \in (0, 1]$: $f(x^{(k)} + \alpha^{(k)} d^{(k)})$ and try to find the largest $\alpha^{(k)}$.

To answer your question: if we have $f: R^n \rightarrow R$ and $x \in R^n$, then the gradient $\nabla f(x^{(k)})$ is a vector with $n$ coordinates (the $n$ partial derivatives of $f$). It has the same size as $x$, so you can safely subtract them.