Is argmin attained equivalent to local convexity?

85 Views Asked by At

I have the following optimization problem:

$$ \theta^* \in \arg \min_\theta f(\theta) $$ $$ \widehat{\theta^*} \in \arg \min_\theta \widehat{f}(\theta) $$

With $f(\theta) = \mathbb{E}(g(X; \theta))$ ($X$ is a r.v.) and $\widehat{f}$ its empirical counterpart: $\widehat{f}(\theta) = \frac{1}{M}\sum_m^Mg(X^m;\theta)$ ($X^m$s are iid). I only assume $g$ is continuous (typically it is a neural network).

I know that for $\theta$ fixed, $f(\theta)-\widehat{f}(\theta)=\mathcal{O}_\mathbb{P}(1/\sqrt{M})$ (CLT).

I can also prove results like $\sup_\theta |f(\theta)-\widehat{f}(\theta)|\rightarrow_{M\rightarrow +\infty}0$ if I use properties of my neural networks $g(.;\theta)$.

Now my question is, if I assume that the $\arg \min$ is attained for both $\theta^*$ and $\widehat{\theta^*}$, can I have a result that "transfers" the convergence property of $\widehat{f} \rightarrow f$ to $\widehat{\theta^*} \rightarrow \theta^*$, like:

$$ \theta^*-\widehat{\theta^*}=\mathcal{O}_\mathbb{P}(1/\sqrt{M}) \quad (1) $$

Moreover, back to the title of the question - if I assume that the $\arg \min$ is attained (say, with a good initialization $\theta_0$ and a stochastic gradient descent), is it equivalent to saying that $f$ and $\widehat{f}$ are strictly convex in a neighbourhood of $\theta_0$, in which case it would be easier to prove $(1)$ (how?)

Thank you

1

There are 1 best solutions below

0
On

Not 100% sure that I understand everything correctly, but let me give you an example that might answer your question.

Let us assume for $\theta = (\theta_1,\theta_2)\in [0,1]^2$ $$ g(x, \theta) = | \theta_1 \max\{x - \theta_2,0\}|^2. $$ So $g$ is the loss of a ReLU neural network trying to approximate the zero function. Choosing $(X^m)_{m=1}^M$ to be i.i.d in $[0,1]$, we have that $\theta^*: = (1, 1)$ is a local minimum of $\hat{f}$ (it is even global) since $\hat{f}(\theta^*)= 0$. This is independent from $M$.

In addition, $\theta = (0, 0)$ is a global minimum of $f$ since $f(\theta) = 0$.

Since $\theta^*\neq \theta$, this shows that the answer to your question is no in general. The issue that makes this example work is that the maps $f$ and $\hat{f}$ are not necessarily injective and have multiple argminima.