I am asking this question in the perspective of linear regression.
To make the gradient descent process faster we use SGD. But SGD takes a single sample randomly and computes the gradient of it moves along direction.
What if the direction given by that sample wrong but still how does the SGD manage to get closer to the global minimum.
Assuming the samples used for SGD are iid, your gradient calculations are unbiased estimators of the expectation of the gradient with respect to your dataset. With a fixed learning rate, your gradient steps will move you in the optimal direction in expectation. One of the benefits of the randomness of SGD is that the noise can sometimes dislodge you from local optima. To reduce the noise though, people often do SGD with "mini-batches", where the gradients are computed against a batch of iid data as opposed to a single sample.