Let $f(x)=x^T A x$, where $A$ is a positive definite matrix ($A \succ 0$).
I need to show that there exists a step size $\eta$ such that $ T \leq O(log(1/\varepsilon))$ iterations of gradient descent are sufficient to achieve $\varepsilon$-optimal solution (i.e. $f(w_T)\leq\varepsilon$ ).
I have proven the following:
- $\nabla f(x)=2A$, so the next step for gradient descent is $w_{t+1}=w_t-\eta\cdot\nabla f(w_t)=w_t-\eta \cdot2Aw_t=(I-2\eta A)w_t$
- Let $B=(I-2\eta A)$
- Let $\lambda > 0$ be eigenvalue of $A$ with the corrseponding eigenvector $v\neq0$. So $Bv=(I-2\eta A)v=Iv-2\eta Av=v-2\eta \lambda v=(1-2\eta\lambda)v$. Hence, $1-2\eta\lambda$ is eigenvalue of $B$ with eigenvector $v$.
- Choose $\eta$ such that $0<1-2\eta\lambda<1$, or equivalently $\eta < \frac{1}{2\lambda}$
- Notice that if $v$ is eigenvector of $A$, so $f(v)=v^TAv=v^T\lambda v =\lambda \lVert v \lVert^2$
- $f(x)=x^TAx$ is convex, so $f(\lambda x)\leq \lambda f(x)$.
- Now, set $w_1$ (starting point) for the gradient descent be an eigenvector of $A$ with the corresponding eigenvalue $\lambda $.
- As shown above, that if $w_1$ is an eigenvector of $A$ with eigenvalue $\lambda$, it follows that $1-2\eta\lambda$ is an eigenvalue of $B$. So inductivly $w_{t+1}=Bw_t=(1-2\eta\lambda)^t w_1$.
- Now, using all this information, we get
$\begin{align*} f(w_{t+1})=f(Bw_t)=f((1-2\eta\lambda)^t w_1) \leq (1-2\eta\lambda)^t \cdot f(w_1)=(1-2\eta\lambda)^t \lambda \lVert v \lVert^2 \end{align*}$
We want $f(w_{t+1}) \leq \varepsilon$, so $(1-2\eta\lambda)^t \lambda \lVert w_1 \lVert^2 \leq \varepsilon$
Now after solving for $t$, I got
$\begin{align*} t \geq \frac{1}{\ln{(1-2\eta \lambda)}}\cdot -\ln{(\frac{\lambda\cdot \lVert w_1 \lVert^2}{\varepsilon})} \end{align*}$
The problem is that $1-2\eta \lambda \leq 1$ so $\ln{(1-2\eta \lambda)}<0$ and it breaks the ineqality.
If anybody could help me finding the mistake or solving this problem.
Notice: I need to show that without using Lipschitz property.
Thank you in advance.
This is solved. It came to my mind that
$$ t \geq \frac{-\ln{(\frac{\lambda\cdot \lVert w_1 \lVert^2}{\varepsilon})}}{\ln{(1-2\eta \lambda)}}$$
On the RHS we have a positive number (because $0 <(1-2\eta \lambda) < 1$ so $\ln{(1-2\eta \lambda)<0}$.
This means that it is enough to have $T=O(\log(1/\varepsilon))$ steps to get to $\varepsilon$-accuracy (and of course any more steps will give us a better bound, but it still works).
Hope it helps someone.