Intuitive examples satisfying Restricted Strong Convexity

898 Views Asked by At

I am reading "A Tight Bound of Hard Thresholding" and on page 13 at definition 8 we have the following:

A differentiable function $f:\mathbb{R}^d\to \mathbb{R}$ is said to satisfy the property of restricted strong convexity (RSC) with parameter $\alpha_k >0 $, if for all vectors $x$ and $y$ in $\mathbb{R}^d$ with $||x-y||_0 \leq k$ it holds

$$f(y)\geq f(x)+\nabla f(x)\cdot(y-x)+\frac{\alpha_k}{2}||x-y||^2$$

where $||\cdot||_0$ is the number of nonzero elements.

What I am understanding is that the above condition says that along some directions our function is strongly convex. I am struggling to come up with good examples (or class of functions) that are easy to comprehend and can justify the intuition behind the definition.

Question: Can you please provide nonlinear examples that are intuitive and satisfy the above property?

2

There are 2 best solutions below

1
On

This should be an example for $d=2$: $$ f(x) = (|x_1|+|x_2|)^2. $$ It is not strongly convex (constant on sets $|x_1|+|x_2|=c$). It should however satisfy the requirement for $k=1$: Take $x\in \mathbb R^2$, we need to check whether $t\mapsto f(x+ te_i)$ is strongly convex for unit vectors $e_1,e_2$. I.e., for $i=1$: $$\begin{split} f(x+te_1)&= (|x_1 + t|+|x_2|)^2 \\&= |x_1 + t|^2 + 2|(x_1+t)x_2|+|x_2|\\ &= t^2 + 2t |x_1| + x_1^2 + 2|(x_1+t)x_2|+|x_2|, \end{split} $$ which is the sum of a strongly convex ($t^2$) and convex functions, hence strongly convex.

3
On

Let me give you another example that arises in statistical learning applications with sparsity constraints. There the function $f$ corresponds to the empirical risk function with squared loss. More precisely, suppose that we observe $n$ input-output pairs $(x_{i}, y_{i}) \in \mathbb{R}^{d} \times \mathbb{R}$ sampled i.i.d. from some distribution $P$ following the model: \begin{align*} x_{i} &\sim P_{X}, \text{ where }P_{X}\text{ denotes the marginal distribution of }X,\\ y_{i} &\sim \langle w^{\star}, x_{i} \rangle + \xi_{i}, \text{ where $w^{\star} \in \mathbb{R}^{d}$ is some $k$-sparse vector and $\xi_{i} \sim N(0, \sigma^{2})$.} \end{align*} The aim is to construct an estimate $\widehat{w} = \widehat{w}((x_{1},y_{1}), \dots, (x_{n}, y_{n})) \in \mathbb{R}^{d}$ such that $\|\widehat{w} - w^{\star}\|_{2}^{2}$ is small. The setting of interest is when $k \ll n \ll d$.

Now let us define $X \in \mathbb{R}^{n \times d}$ and $y \in \mathbb{R}^{n}$ by letting the $i$-th row of $X$ equal $x_{i}^{\mathsf{T}}$ and letting the $i$-th element of $y$ equal $y_{i}$. We can then define the empirical risk function by $$ f(w) = \frac{1}{2n}\|Xw - y\|_{2}^{2}. $$ Observe that $f$ is not strongly convex, since its hessian is equal to $\frac{1}{n}X^{\mathsf{T}}X$, which is of rank at most $n \ll d$. At the same time, the condition $n \ll d$ implies that there exists an infinite number of $w \in \mathbb{R}^{d}$ attaining $f(w) = 0$ and thus minimizing $f$ is not enough to approximately recover the true parameter $w^{\star}$.

Here is how the restricted strong convexity kicks in. We can define our estimator (completely ignoring the computational issues) as any solution to the below equation $$ \widehat{w} \in \mathrm{argmin}_{w \in \mathbb{R}^{d}, \|w\|_{0} \leq k} f(w). $$ Thus, the output vector is $k$-sparse (i.e., $\|\widehat{w}\|_{0} \leq k$) and in particular, $\|\widehat{w} - w^{\star}\|_{0} \leq 2k$. The restricted strong convexity assumption ensures that $f$ contains enough curvature in the direction $w^{\star} - \widehat{w}$ such that small value of $f(w^{\star}) - f(\widehat{w})$ translates to a small $\ell_{2}$ norm $\|\widehat{w} - w^{\star}\|_{2}^{2}$.


When is restricted strong convexity satisfied naturally in the above example? To finally answer the @Sepide's question, the strong convexity condition, in our example above, reads as: \begin{align*} &f(w_{2})\geq f(w_{1})+\nabla f(w_{1})\cdot(w_{2}-w_{1})+\frac{\alpha_k}{2}||w_{1}-w_{2}||^2\\ \iff& \frac{1}{2n}\|Xw_{2} - y\|_{2}^{2} \geq \frac{1}{2n}\|Xw_{1} - y\|_{2}^{2} + \langle\frac{1}{n}X^{\mathsf{T}}(Xw_{1} - y), w_{2} - w_{1}\rangle + \frac{\alpha_{k}}{2}\|w_{1} - w_{2}\|_{2}^{2} \\ \iff& \frac{1}{2n}\|Xw_{2} - Xw_{1} + Xw_{1} - y\|_{2}^{2} \geq \frac{1}{2n}\|Xw_{1} - y\|_{2}^{2} + \langle\frac{1}{n}X^{\mathsf{T}}(Xw_{1} - y), w_{2} - w_{1}\rangle + \frac{\alpha_{k}}{2}\|w_{1} - w_{2}\|_{2}^{2} \\ \iff& \frac{1}{2n}\|Xw_{2} - Xw_{1}\|_{2}^{2} \geq \frac{\alpha_{k}}{2}\|w_{1} - w_{2}\|_{2}^{2} \\ \iff& \frac{1}{n}\|X(w_{1} - w_{2})\|_{2}^{2} \geq \alpha_{k} \|w_{1} - w_{2}\|_{2}^{2}. \end{align*} Thus, the restricted strong convexity condition, in our setting, reduces to a restricted eigenvalue condition on $\frac{1}{n}X^{\mathsf{T}}X$. Let $X_{S} \in \mathbb{R}^{n \times |S|}$ denote a matrix constructed by taking columns of $X$ as specified by some support set $S \subseteq \{1, \dots, d\}$. Then another way to view the restricted eigenvalue condition is to say that $$ \text{for any }|S| \leq k\text{ we have } \frac{1}{n}X^{\mathsf{T}}_{S}X_{S} \succcurlyeq \alpha_{k}I. $$ The above condition can be ensured, for example, by sampling the entries of $X$ by i.i.d. $N(0,1)$ distribution. Then, by results from random matrix theory you would have for any $S$, $\|\frac{1}{n}X^{\mathsf{T}}_{S}X_{S} - I\| \leq c\sqrt{\frac{\|S\|}{n}}$ with high probability, where $c$ is some absolute constant and $\|\cdot\|$ denotes the operator norm. By the union bound, taking $n \in \Theta (k \log(ed/k))$ samples would be enough to ensure restricted strong convexity with $\alpha_{k} \approx 1$ (in fact, this would ensure the stronger condition of restricted isometry property). The setting discussed in this post is natural, for instance, in compressed sensing applications, where you have full control over the marginal distribution $P_{X}$.