1. Background of Proximal Gradient Descent
I am studying and using Proximal Gradient Descent (PGD) to solve the following vector optimization problem:
$$ \hat{\mathbf{x}}=\underset{{\mathbf{x}}}{\arg\min}~f(\mathbf{x})+\lambda g(\mathbf{x}), $$
where $\hat{\mathbf{x}}\in\mathbb{R}^N$ is the estimation of target signal $\bar{\mathbf{x}}$ I want, $f:\mathbb{R}^N\rightarrow[0,+\infty)$ is a smooth and convex function (e.g. $\ell_2$-error), $g:\mathbb{R}^N\rightarrow[0,+\infty)$ is a convex function (regularizer), and $\lambda\in\mathbb{R}^+$ is the regularization parameter.
From [Lecture Note 1],[Lecture Note 2] and [Lecture Note 3], I may know that for a given initialization $\hat{\mathbf{x}}_0$, the PGD is convergent ($\hat{\mathbf{x}}_k\rightarrow \bar{\mathbf{x}}$), and the iterating step of which is as follows:
$$ \hat{\mathbf{x}}_{k+1}=\text{prox}_{\lambda g}(\hat{\mathbf{x}}_{k}-\rho_k\nabla f(\hat{\mathbf{x}}_{k})), $$
where $\rho_k$ is the step size.
2. Background of My Experiments
(I do some experiments for the Compressed Sensing (CS) task based on LASSO criterion, which aims to linearly subsample a sparse signal and nonlinearly recover it.)
In my experiments, I discover that the trivial PGD method is able to find a good solution which can be close to $\bar{\mathbf{x}}$.
Then, I generalize the original optimization problem to a high-dimensional space:
$$ \hat{\mathbf{X}}=\underset{{\mathbf{X}}}{\arg\min}~F(\mathbf{X})+\lambda G(\mathbf{X}), $$
and solve it using generalized PGD:
$$ \hat{\mathbf{X}}_{k+1}=\text{prox}_{\lambda G}(\hat{\mathbf{X}}_{k}-\rho_k\nabla F(\hat{\mathbf{X}}_{k})), $$
where the iterations perform in an $(N+C)$-dimensional space, each $\hat{\mathbf{X}}_k\in\mathbb{R}^{N+C}$ is a generalized middle result in the feature space with a higher dimensionality $(N+C)$, and the generalized $F(\cdot)$ and $G(\cdot)$ take the same forms of the original $f(\cdot)$ and $g(\cdot)$ (the difference is that the generalized ones take $(N+C)$-dimensional inputs).
Here I project the initialization to an $(N+C)$-dimensional feature space $\hat{\mathbf{X}}_0=\mathbf{A}\hat{\mathbf{x}}_\text{init}\in\mathbb{R}^{N+C}$ ($\mathbf{A}\in\mathbb{R}^{(N+C)\times N}$), and get the final estimation by a linear recoverer $\hat{\mathbf{x}}=\mathbf{B}\hat{\mathbf{X}}_K\in\mathbb{R}^{N}$ ($\mathbf{B}\in\mathbb{R}^{N\times(N+C)}$, $K$ is the total iteration number). And there is $\mathbf{BA}=\mathbf{I}_N$, i.e., $\forall \mathbf{x}$, $\mathbf{BAx}=\mathbf{x}$.
After some fine-tunings and utilizations of deep learning for getting all the hyper-parameters and settings, I discover that the above generalization is possible to get a better result than the original one under the same setting of total iteration number $K$, the mean squared error is able to be further reduced.
However, the deep learned parameters of PGD or generalized PGD, from my pre-collected dataset, acts like black-box, so it is hard for me to interpret them.
(More distracting details are omitted.)
3. My Question
When $C=0$ and $\mathbf{A}=\mathbf{B}=\mathbf{I}_N$, with some trivial settings, the generalizetion degrades to the original framework.
$(3.1)$ I am confused by that why the generalization is able to get better results when $C>0$ with some proper settings?
$(3.2)$ Is there a way to show that the generalized PGD is able to converge (e.g., $\mathbf{B}\hat{\mathbf{X}}_k\rightarrow \bar{\mathbf{x}}$) faster than the original one, under some assumptions? I guess that the high-dimensional PGD is able to be more powerful, at least in constant scale.
$(3.3)$ Is the observed phenomenon related to the "momentum" strategy?
4. My Efforts
After a long struggle and searching online, I can still not prove the superiority or the faster convergence of the generalized PGD, optimizing in a high-dimensional feature space.
I can even not find a correct point to start.
Is there a way to show or analyse this?
Could you please prove or disprove that the generalized PGD may be better than the original PGD?
Update (2022-7-13):
I am working on the application of mathematical induction. Specifically, $\forall k\in \mathbb{N}$, I want to prove or disprove the following inequality:
$$ \lVert \mathbf{BX}_k -\mathbf{x}^* \rVert^2_2 \le \lVert \mathbf{x}_k -\mathbf{x}^* \rVert^2_2. $$
When $k=0$, there is $\mathbf{BX}_0=\mathbf{BAx}_0=\mathbf{x}_0$.
Then I suppose that the inequality holds when $0\le k\le t$, and am stuck by analyzing the case of $k=t+1$.
Is this a correct way? How to make some assumptions to solve it?