I was reading a paper by great mathematician Sourav Chatterjee.
I got particularly stuck here, along with Theorem 2.1
Question 1 :- What does that $\alpha(x_0,r)$ mean? Why is it needed?
Can anyone help me understand this section clearly?
I was reading a paper by great mathematician Sourav Chatterjee.
I got particularly stuck here, along with Theorem 2.1
Question 1 :- What does that $\alpha(x_0,r)$ mean? Why is it needed?
Can anyone help me understand this section clearly?
Copyright © 2021 JogjaFile Inc.

I haven't read the paper so this answer is more general in that it will attempt to explain why such assumptions are needed. The goal of papers like this is to prove convergence of an algorithm (gradient descent) to a minimizer of some loss function at some rate. In classical statistics, one might assume that the loss function is convex and differentiable, in which case there exist theoretical guarantees on the performance of gradient descent on such a loss function, see here for example:
https://www.stat.cmu.edu/~ryantibs/convexopt-F13/scribes/lec6.pdf
If you read through these proofs, it will be clear why assumptions such as convexity, differentiability, lipschitz-ness, etc are important for the theorems to hold true. One cannot hope that gradient descent will be successful when applied to arbitrary assumptions.
In the case of deep learning, one can make the same assumptions on the loss, and use the same types of results. The problem with this approach though is that for deep learning, the classical assumptions are severely violated, to the point that the classic results are rendered useless to explain the performance seen in practice. Deep learning involves highly non-linear and non-convex loss landscapes, so there is a need for novel assumptions that are more realistic for these settings. One such assumption (PL*) is outlined in some detail here:
https://arxiv.org/pdf/2003.00307.pdf
and as mentioned in the paper, is related to the condition you are interested in. This assumption ensures that a solution exists and a certain convergence rate can be achieved by gradient descent. What's more, this assumption is more realistic in the deep learning setting than the classical assumptions.
The paper you refer to just uses a slightly different assumption, to understand exactly why this assumption is needed, you need to read through the proofs, but the general idea is what I have described here.