Question About Gradient Descent

312 Views Asked by At

Gradient descent is numerical optimization method for finding local/global minimum of function. It is given by following formula: $$ x_{n+1} = x_n - \alpha \nabla f(x_n) $$

For sake of simplicity let us take one variable function $f(x)$. In that case, gradient becomes derivative $\frac {df} {dx} $ and formula for gradient descent becomes: $$ x_{n+1} = x_n - \alpha \frac {df} {dx} $$

My question is: How can we get new iterands $x_{n+1}$ from change in value of $f$? Gradient defines both direction and value of biggest increase of $f$ at certain point, not how much $x$ changes and so it has no sense to me that we use it in formula to compute new values of $x$.

4

There are 4 best solutions below

0
On

The intuitive notion behind this choice is that a graph is usually steeper farther away from an extreme and flatter closer to an extreme.

So if the gradient is large (as in far away from $0$), then we are presumably far away from an extreme, and we can take a large step without worrying too much about stepping past the extreme we're looking for.

If the gradient is small (as in close to $0$), then presumably we are getting close to a local extreme, so we reduce our step size accordingly so that we don't stray too far away.

Using the gradient directly rather than trying to make a more qualified guess as to how far away the extreme is makes this algorithm easy to calculate, while at the same time it turns out to give decent results. A lot rides on the choice of $\alpha$, though. Which technically doesn't have to be constant.

0
On

Suppose We want to find $f(x,y,z)=0$

Then we want $df=\nabla f \cdot d\vec{s}=0$.

I terms of a finite difference, $f(\vec{u}+d\vec{u})-f(\vec{u})=\frac{\partial f}{\partial x}(\Delta x)+\frac{\partial f}{\partial y}(\Delta y)+\frac{\partial f}{\partial z}(\Delta z)$.

If we want $f(\vec{u}+d\vec{u})=0$ ,for a single variable, this becomes $\Delta x=x_{n+1}-x_n=\frac{-f}{df/dx}$

For ascent/descent we know the gradient is direction of steepest ascent. We can take a controlled step in that direction. In that case, the gradient and our step are parallel.

If we want an approximate change of $\Delta v$ in f, then we want $\Delta f\approx\frac{\nabla f}{|\nabla f|}\cdot \frac{\Delta v\nabla f}{|\nabla f|}$, i.e. the gradient as unit vector dotted with its direction vector scaled by the desired change in $f$. So the vector coordinate change is $\Delta v \frac{\nabla f}{|\nabla f|}$. If you want to restrict $\Delta f=\Delta r$, you can use $\Delta v=\Delta r/(\nabla f/|\nabla f|)$. Sometimes you want to change your step size as a function fo $|\nabla f|$. Chaging $\Delta r$ heps you dot that.

0
On

Before we go further, let's take a look at a part of your question.

Gradient defines both direction and value of biggest increase of $f$ at certain point, not how much changes and so it has no sense to me that we use it in formula to compute new values of $\mathbf{x}$.

You are right that the gradient vector at some point $\nabla f(\mathbf{x})$ gives us the magnitude and direction of greatest increase of $f$ at the point $x$.

By how this function is structured, the value of $f$ changes based on the input of $x$. We are optimizing by finding the right value of $\mathbf{x}$ such that $f$ is at its global minimum.

We know that $\nabla f(\mathbf{x})$ points in the direction of greatest increase, so naturally we go in the opposite direction to travel in the direction of the greatest decrease. The $\mathbf{x}$ values in this direction, when inputted into $f$, output smaller values because the function is decreasing.

We are now changing the $x$ values so that they go towards this decrease

To better understand this, let's take apart the update rule. As you said:

$$\mathbf{x}_{i + 1} = \mathbf{x_i} - \alpha\nabla f(\mathbf{x_i})$$

The $\alpha$ parameter is a learning rate that defines how large our step in the direction of the negative gradient is.

$\nabla f(x_i)$ gives us the change in $f$ at the current point, $x_i$. By subtracting this, we go against the direction of greatest change in $f$ that happens because of $x$.

In other words, the algorithm optimizes $\mathbf{x_i}$ such that we travel in the direction of the $\mathbf{x}_{i + n}$ values that gives the lowest values of $f$.


We can visualize this from a basic example of gradient descent.

We will attempt to optimize a variable of one function, $f(x) = x^2$.

I created a small program for this, and you can run this for yourself as well to mess around. (I did not implement autodiff or similar methods to save time)

def gd(f, df, learning_rate=0.1, left_bound=-10, right_bound=10, interval=1, num_iterations=100, x=-10): 
  points = np.arange(left_bound, right_bound, interval)
  y = f(points)
  fig, axs = plt.subplots()
  axs.plot(points, y)
  for i in range(num_iterations):
    x = x - learning_rate * df(x)
    if i + 1 == num_iterations:
      axs.plot(x, f(x), 'ro')
    else:
      axs.plot(x, f(x), 'bo')

If we look at the output, it looks like:

Gradient Descent Visualized

with:

gd(lambda x: x**2, lambda x: 2 * x)

You can see that the algorithm slowly approaches the minimum of $f$ for a small enough value of $\alpha$.

0
On

Thanks everyone for the answer since they helped me to answer my question. Most basic equation I've written for gradient descent: $$ x_{n+1} = x_n - \alpha \nabla f(x_n)$$ needs to be seen in vector form where points $x_n$ and $x_{n+1}$ can be defined as position vectors. I had problems with understanding this equation because I didn't look at it in terms of vectors.

Since gradient is a vector valued function which gives direction of steepest ascent at point $ x_n $, subtracting position vector $ x_n $ by $\alpha \nabla f(x_n)$ has a lot of sense since we move in the direction of greatest descent and get to the next point given by position vector $x_{n+1}$.

Parameter $\alpha$ basically defines how long in the opposite direction of the gradient vector we want to go (opposite because gradient defines directon of steepest ascent and we want to descend). If parameter has value for example 0,5, it means we move in the opposite direction of the gradient vector by length equal to 0,5 value of gradient at the point $x_n$.

Its value can be changed during optimization. If it is too big we can miss the minimum and if it is too small it can get too many iterations to converge.

I would say if value of gradient is big step size can be bigger and if gradient value is small that means we are close and so we need to make step size smaller not to miss the minimum we are close to.