I am studying about Gradient Descent and Stochastic Gradient Descent, and the text says that one of the advantages of sgd over gd is, that gd can be computationally expensive for large datasets. In particular, it says that in gd, the gradient is calculated for all of the data points and in sgd on a random batch.
My question is, Why do we need to calculate any other point except for the one we are currently at?
Let's take a Quadratic function for example:
If we are currently at x1, and we want to go towards the minimum, why would we even care about the derivative in x2? it's not like we are going to say "oh yeah at x2 its steeper let's go there" .. it is also not implemented in code in any way.
Also in sgd, it calculates a PART of the datapoints, which is still not understood why do we need any other point than the current 1? why not calculate 1 point at a time if we step on it instead of calculating all?
Thank you and please give feedback if I can become clearer in my question