Convergence of Proximal Gradient Descent

211 Views Asked by Bumbble Comm At 25 Mar 2026 - 9:47

1. Background of Proximal Gradient Descent

I am studying and using Proximal Gradient Descent (PGD) to solve the following vector optimization problem:

$$ \hat{\mathbf{x}}=\underset{{\mathbf{x}}}{\arg\min}~f(\mathbf{x})+\lambda g(\mathbf{x}), $$

where $\hat{\mathbf{x}}\in\mathbb{R}^N$ is the estimation of target signal $\bar{\mathbf{x}}$ I want, $f:\mathbb{R}^N\rightarrow[0,+\infty)$ is a smooth and convex function (e.g. $\ell_2$-error), $g:\mathbb{R}^N\rightarrow[0,+\infty)$ is a convex function (regularizer), and $\lambda\in\mathbb{R}^+$ is the regularization parameter.

From [Lecture Note 1],[Lecture Note 2] and [Lecture Note 3], I may know that for a given initialization $\hat{\mathbf{x}}_0$, the PGD is convergent ($\hat{\mathbf{x}}_k\rightarrow \bar{\mathbf{x}}$), and the iterating step of which is as follows:

$$ \hat{\mathbf{x}}_{k+1}=\text{prox}_{\lambda g}(\hat{\mathbf{x}}_{k}-\rho_k\nabla f(\hat{\mathbf{x}}_{k})), $$

where $\rho_k$ is the step size.

2. Background of My Experiments

(I do some experiments for the Compressed Sensing (CS) task based on LASSO criterion, which aims to linearly subsample a sparse signal and nonlinearly recover it.)

In my experiments, I discover that the trivial PGD method is able to find a good solution which can be close to $\bar{\mathbf{x}}$.

Then, I generalize the original optimization problem to a high-dimensional space:

$$ \hat{\mathbf{X}}=\underset{{\mathbf{X}}}{\arg\min}~F(\mathbf{X})+\lambda G(\mathbf{X}), $$

and solve it using generalized PGD:

$$ \hat{\mathbf{X}}_{k+1}=\text{prox}_{\lambda G}(\hat{\mathbf{X}}_{k}-\rho_k\nabla F(\hat{\mathbf{X}}_{k})), $$

where the iterations perform in an $(N+C)$-dimensional space, each $\hat{\mathbf{X}}_k\in\mathbb{R}^{N+C}$ is a generalized middle result in the feature space with a higher dimensionality $(N+C)$, and the generalized $F(\cdot)$ and $G(\cdot)$ take the same forms of the original $f(\cdot)$ and $g(\cdot)$ (the difference is that the generalized ones take $(N+C)$-dimensional inputs).

Here I project the initialization to an $(N+C)$-dimensional feature space $\hat{\mathbf{X}}_0=\mathbf{A}\hat{\mathbf{x}}_\text{init}\in\mathbb{R}^{N+C}$ ($\mathbf{A}\in\mathbb{R}^{(N+C)\times N}$), and get the final estimation by a linear recoverer $\hat{\mathbf{x}}=\mathbf{B}\hat{\mathbf{X}}_K\in\mathbb{R}^{N}$ ($\mathbf{B}\in\mathbb{R}^{N\times(N+C)}$, $K$ is the total iteration number). And there is $\mathbf{BA}=\mathbf{I}_N$, i.e., $\forall \mathbf{x}$, $\mathbf{BAx}=\mathbf{x}$.

After some fine-tunings and utilizations of deep learning for getting all the hyper-parameters and settings, I discover that the above generalization is possible to get a better result than the original one under the same setting of total iteration number $K$, the mean squared error is able to be further reduced.

However, the deep learned parameters of PGD or generalized PGD, from my pre-collected dataset, acts like black-box, so it is hard for me to interpret them.

(More distracting details are omitted.)

3. My Question

When $C=0$ and $\mathbf{A}=\mathbf{B}=\mathbf{I}_N$, with some trivial settings, the generalizetion degrades to the original framework.

$(3.1)$ I am confused by that why the generalization is able to get better results when $C>0$ with some proper settings?

$(3.2)$ Is there a way to show that the generalized PGD is able to converge (e.g., $\mathbf{B}\hat{\mathbf{X}}_k\rightarrow \bar{\mathbf{x}}$) faster than the original one, under some assumptions? I guess that the high-dimensional PGD is able to be more powerful, at least in constant scale.

$(3.3)$ Is the observed phenomenon related to the "momentum" strategy?

4. My Efforts

After a long struggle and searching online, I can still not prove the superiority or the faster convergence of the generalized PGD, optimizing in a high-dimensional feature space.

I can even not find a correct point to start.

Is there a way to show or analyse this?

Could you please prove or disprove that the generalized PGD may be better than the original PGD?

Update (2022-7-13):

I am working on the application of mathematical induction. Specifically, $\forall k\in \mathbb{N}$, I want to prove or disprove the following inequality:

$$ \lVert \mathbf{BX}_k -\mathbf{x}^* \rVert^2_2 \le \lVert \mathbf{x}_k -\mathbf{x}^* \rVert^2_2. $$

When $k=0$, there is $\mathbf{BX}_0=\mathbf{BAx}_0=\mathbf{x}_0$.

Then I suppose that the inequality holds when $0\le k\le t$, and am stuck by analyzing the case of $k=t+1$.

Is this a correct way? How to make some assumptions to solve it?

Original Q&A

Convergence of Proximal Gradient Descent

Related Questions in CONVERGENCE-DIVERGENCE

Related Questions in CONVEX-ANALYSIS

Related Questions in GRADIENT-DESCENT

Related Questions in NON-CONVEX-OPTIMIZATION

Related Questions in PROXIMAL-OPERATORS

Trending Questions

Popular # Hahtags

Popular Questions