Optimization of approximate functions using varying objective function

193 Views Asked by At

Let $g(\theta;x)$ and $f(\theta;x)$ be two convex functions such that $g$ asymptotically approximates $f$: $g(\theta;x)\approx f(\theta;x)$, specifically:

$$ |g(\theta;x)-f(\theta;x)| \leq \mathcal{O}\left( \frac{1}{\sqrt{\|\theta\odot x\|_0}}\right) $$

As $\|\theta\odot x\|_0$ (the number of the intersection of non-zeros in $\theta$ and $x$) increases the approximation becomes tighter, and at the same time due to the nature of $f(\theta)$, it becomes harder to evaluate $f(\theta)$.

Our ideal goal is to:

$$\underset{\theta}{\text{minimize}} \sum_i f(\theta;x_i)$$

For dense $x_i\odot \theta$ (i.e. when the the number of zeros in $x_i$ is not large, and $\theta$ is guaranteed to have few or no zeros), you can safely solve the following optimization program instead:

$$\underset{\theta}{\text{minimize}} \sum_i g(\theta;x_i)$$

In reality, some of the $x_i$'s can be pretty sparse. Or some times $\theta$ itself can become sparse during the optimization. For such $x_i$'s and $\theta$ you can cheaply form $f(x_i)$ instead of approximating it by $g$.

The problem is that theoretically you can not switch between optimization programs when running the optimization algorithm (correct me if I'm wrong). Otherwise, we could do the following:

$$ h(\theta; x,T)=I(\|\theta\odot x\|_0\leq T)f(\theta; x)+I(\|\theta\odot x\|_0> T)g(\theta; x) $$

where $T$ is a predetermined threshold value, and $I(True)=1, I(False)=0$.

And, minimize $h$:

$$\underset{\theta}{\text{minimize}} \sum_i h(\theta;x_i,T)$$

This can be done by choosing the objective function at each iteration of the optimization algorithm (e.g. gradient descent).

In this formulation, $h$ is not convex any more.

My questions are:

  1. What is the best solution to this problem?
  2. What kind of convergence guarantees can we have? (This is specifically important for me, you can assume gradient descent, SGD, LBFGS, or some other optimization algorithm for the analysis.)
  3. Is there any thing fundamentally wrong with my suggested solution?
  4. What can we do if $g$ is convex in each dimension but can be non-convex in general?

Thanks for any help!