Gradient descent against fixed orthonormal set

134 Views Asked by At

Given $c$ orthonormal vectors $\mathbf y_1, \mathbf y_2, \dots, \mathbf y_c \in \mathbb R^n$, I have the following cost function to minimize.

$$ J ({\bf x}) := -\frac{1}{1 + \frac1c \sum\limits_{i=1}^c \exp \left({\bf y}_i^\top {\bf x} \right)} $$

The gradient of this function is a linear combination of $\mathbf y_i$: $$\nabla_\mathbf x J=\frac{J^2}{c}\sum_i{\exp(\mathbf y_i^T\mathbf x)\mathbf y_i}$$

Therefore, as optimization proceeds, $\mathbf x$ is pulled towards the subspace spanned by $\mathbf y_i$. Intuitively, as $\mathbf x$ is updated with gradient descent, $\mathbf y_i^T\mathbf x$ should decrease for each $\mathbf y_i$, which should align $\mathbf x$ away from all $\mathbf y_i$ in the limit: $$\lim_{\mathbf y_i^T\mathbf x\to-\infty} \frac{\sum_i\mathbf y_i^T}{\sqrt c}\cdot\frac{\mathbf x}{\|\mathbf x\|} = -1$$

Is my intuition correct? If so, how can I proceed to prove the alignment formally?

Update:

If $\mathbf x$ is initialized to $\mathbf 0$, then the proof is straightforward, since $\mathbf y_i^T\mathbf x = k$ for all $i$ at each step.

$$\nabla_\mathbf x J=s\sum_i{\mathbf y_i}$$

where $$s=\frac{J^2}{c}\exp(k)$$

The GD updates then simply integrate the scaled $\sum_i{\mathbf y_i}$ over $t$ steps: $$\mathbf x^t = -\sum_t s_t \sum_i \mathbf y_i$$ Thus the direction of of $\mathbf x^t$ is always in the opposite direction of $\sum \mathbf y_i$. Hence the proof.

On the other hand, if $\mathbf x$ is randomly initialized, I am not able to proceed.

1

There are 1 best solutions below

0
On BEST ANSWER

The alignment with the negative direction of $\sum \mathbf y_i$ can be proven via the GD update rule: $$\mathbf x^{t+1} = \mathbf x^t - \eta\nabla_\mathbf xJ$$

Taking the inner product with $\sum \mathbf y_i$ and using distributive property, we have $$\left(\sum_i \mathbf y_i\right)^T\mathbf x^{t+1} = \left(\sum_i \mathbf y_i\right)^T\mathbf x^t - \left(\sum_i \mathbf y_i\right)^T\eta\nabla_\mathbf xJ$$ $$=\left(\sum_i \mathbf y_i\right)^T\mathbf x^t - \eta\frac{J^2}{c}\sum_i{\exp(\mathbf y_i^T\mathbf x)} $$ Dividing by the norms of $\sum \mathbf y_i$ and $\mathbf x^{t+1}$, $\sqrt c$ and $\|\mathbf x^{t+1}\|$respectively, we have $$\frac{(\sum \mathbf y_i)^T\mathbf x^{t+1}}{\|\mathbf x^{t+1}\|\sqrt c} = \frac{(\sum \mathbf y_i)^T\mathbf x^t}{\|\mathbf x^{t+1}\|\sqrt c} - \eta\frac{J^2}{\|\mathbf x^{t+1}\|c\sqrt c}\sum_i{\exp(\mathbf y_i^T\mathbf x)}$$

Now we multiply and divide on the right side by $\|\mathbf x^t\|$: $$\frac{(\sum \mathbf y_i)^T\mathbf x^{t+1}}{\|\mathbf x^{t+1}\|\sqrt c} = \frac{\|\mathbf x^t\|}{\|\mathbf x^{t+1}\|}\frac{(\sum \mathbf y_i)^T\mathbf x^t}{\|\mathbf x^t\|\sqrt c} - \frac{\|\mathbf x^t\|}{\|\mathbf x^{t+1}\|}\frac{\eta J^2}{\|\mathbf x^t\|c\sqrt c}\sum_i{\exp(\mathbf y_i^T\mathbf x)}$$ $$= \frac{\|\mathbf x^t\|}{\|\mathbf x^{t+1}\|}\left[\frac{(\sum \mathbf y_i)^T\mathbf x^t}{\|\mathbf x^t\|\sqrt c} - \frac{\eta J^2}{\|\mathbf x^t\|c\sqrt c}\sum_i{\exp(\mathbf y_i^T\mathbf x)}\right]$$

Note that the term on the left describes the cosine of the angle $\theta^{t+1}$ between $\mathbf x^{t+1}$ and $\sum \mathbf y_i$. Similarly, the first term inside brackets on the right denotes $\cos \theta^t$ $$\cos \theta^{t+1} = \frac{\|\mathbf x^t\|}{\|\mathbf x^{t+1}\|}\left[\cos\theta^t - \frac{\eta J^2}{\|\mathbf x^t\|c\sqrt c}\sum_i{\exp(\mathbf y_i^T\mathbf x)}\right]$$ Because $\frac{\|\mathbf x^t\|}{\|\mathbf x^{t+1}\|} < 1$ since $\|\mathbf x\|$ grows, and the update term is always positive, we have $$\cos \theta^{t+1}<\cos\theta^t$$ which implies that $$\lim_{\mathbf y_i^T\mathbf x\to-\infty} \frac{\sum_i\mathbf y_i^T}{\sqrt c}\cdot\frac{\mathbf x}{\|\mathbf x\|} = \lim_{t\to\infty}\cos\theta^t=-1 $$


Note that it can be concluded from the above arguments that if $\|\mathbf x\|$ is small the angles decrease faster. If it is fixed to $k$, we have a projected GD (PGD) on the $k$-hypersphere. In that case, PGD will oscillate around the optimal alignment with the intensity of such oscillation determined by the choice of $\eta$ and $k$.