Let $g(\theta;x)$ and $f(\theta;x)$ be two convex functions such that $g$ asymptotically approximates $f$: $g(\theta;x)\approx f(\theta;x)$, specifically:
$$ |g(\theta;x)-f(\theta;x)| \leq \mathcal{O}\left( \frac{1}{\sqrt{\|\theta\odot x\|_0}}\right) $$
As $\|\theta\odot x\|_0$ (the number of the intersection of non-zeros in $\theta$ and $x$) increases the approximation becomes tighter, and at the same time due to the nature of $f(\theta)$, it becomes harder to evaluate $f(\theta)$.
Our ideal goal is to:
$$\underset{\theta}{\text{minimize}} \sum_i f(\theta;x_i)$$
For dense $x_i\odot \theta$ (i.e. when the the number of zeros in $x_i$ is not large, and $\theta$ is guaranteed to have few or no zeros), you can safely solve the following optimization program instead:
$$\underset{\theta}{\text{minimize}} \sum_i g(\theta;x_i)$$
In reality, some of the $x_i$'s can be pretty sparse. Or some times $\theta$ itself can become sparse during the optimization. For such $x_i$'s and $\theta$ you can cheaply form $f(x_i)$ instead of approximating it by $g$.
The problem is that theoretically you can not switch between optimization programs when running the optimization algorithm (correct me if I'm wrong). Otherwise, we could do the following:
$$ h(\theta; x,T)=I(\|\theta\odot x\|_0\leq T)f(\theta; x)+I(\|\theta\odot x\|_0> T)g(\theta; x) $$
where $T$ is a predetermined threshold value, and $I(True)=1, I(False)=0$.
And, minimize $h$:
$$\underset{\theta}{\text{minimize}} \sum_i h(\theta;x_i,T)$$
This can be done by choosing the objective function at each iteration of the optimization algorithm (e.g. gradient descent).
In this formulation, $h$ is not convex any more.
My questions are:
- What is the best solution to this problem?
- What kind of convergence guarantees can we have? (This is specifically important for me, you can assume gradient descent, SGD, LBFGS, or some other optimization algorithm for the analysis.)
- Is there any thing fundamentally wrong with my suggested solution?
- What can we do if $g$ is convex in each dimension but can be non-convex in general?
Thanks for any help!