Number of calls to oracle in Randomized Stochastic Gradient (RSG) algorithm

67 Views Asked by At

In Stochastic First-and Zeroth-order Methods for Nonconvex, on page 10, the author claims that the number of calls to Stochastic First order Oracle (SFO) for finding an $(\epsilon, \Lambda)$ solution after disregarding a few constant factors, can be bounded by $$ \mathcal{O}(\frac{1}{\Lambda \epsilon} + \frac{\sigma^2}{\Lambda^2 \epsilon^2})\tag{1} $$ which apparently is a consequence of $(2.19)$. However, I have the following questions:

1- What is the relation between $\lambda$ and $\Lambda$, because form Markov inequity and $(2.14)$ we have $$ \text{Prob}\{ \| f(x_R) \|^2 \geq \lambda L \mathcal{B}_N \} \leq \frac{1}{\lambda}\tag{2.19} $$ So $$ 1 - \text{Prob}\{ \| f(x_R) \|^2 < \lambda L \mathcal{B}_N \} \leq \frac{1}{\lambda} $$ Since $\| f(x_R) \|^2$ is a continuous random variable, we have equality in the $\text{Prob}$ and $$ \text{Prob}\{ \| f(x_R) \|^2 \leq \lambda L \mathcal{B}_N \} \geq 1- \frac{1}{\lambda} $$ Is $\Lambda = \frac{1}{\lambda}$?

2- Is the number of calls the same as expected value of $R$? If so, by choosing constant stepsize pmf of $R$ is uniform and it becomes $N/2$, then how we are getting bound $(1)$?

3- From which formula $(1)$ is derived and what are the steps?

Note: Please answer all the questions I have not partially. Also, please help me to walk through all the steps deriving $(1)$?

1

There are 1 best solutions below

0
On BEST ANSWER

That is a very interesting paper. I will provide a "brief" derivation but the steps in between can be derived rather easily.

1,2,3 - By Markov's bound, you have

$\text { Prob }\left\{\|\nabla f(\bar{x})\|^{2} \leq \epsilon\right\} \geq 1-\Lambda,$

or in other words, by looking at the "complement" of that equation, you get

$\text { Prob }\left\{\|\nabla f(\bar{x})\|^{2} \geq \epsilon\right\} \geq \Lambda.$

Now all that is left to do is to compare the above equation with equation (2.19) in the paper. Therefore, you get

$\epsilon = \lambda L \mathcal{B}_{N}$ and $\Lambda = \frac {1}{\lambda}.$

Once you have these equations you are basically there. Now use the above two equations in combination with (2.14) in your paper to write

$L\left(\frac{LD_{f}^{2}}{N}+\left(\tilde{D}+\frac{D_{f}^{2}}{\tilde{D}}\right)\frac{\sigma}{\sqrt{N}}\right)=\epsilon\Lambda.$

Define constants $C_{1}=D_{f}^{2}$ and $C_{2}:=\tilde{D}+\frac{D_{f}^{2}}{\tilde{D}}$ and rewrite the above equation to get

$C_{1}\frac{L}{N}+C_{2}\frac{\sigma}{\sqrt{N}}=\frac{\epsilon\Lambda}{L}.$

This equation is a quadratic in terms of $N$. Solve the quadratic for $N$ and you will get

$N \sim \mathcal{O}\left\{\frac{1}{\Lambda \epsilon}+\frac{\sigma^{2}}{\Lambda^{2} \epsilon^{2}}\right\},$

which is what you are looking for. Good luck!