Stability of 'min' versus 'argmin' in nested optimization problems

484 Views Asked by At

In reading paper [1], there is a comment about the numerical stability of 'min' and 'argmin' operations in optimization which I don't understand. For background, there is a variational characterization of two functions $\overline{Q}_X$ and $Q_X$, (I don't think the details of them are important)

$$\overline{Q}_X(p) = \text{min}_x f(x, p),$$ $$Q_X(p) = \text{argmin}_x f(x, p).$$

These formulas are useful for incorporating $\overline{Q}_X$ and $Q_X$ into optimization problems since minimization over $p$ can be done through joint minimization with an auxiliary variable x. However, the author makes a comment, which I paraphrase: "As the formulas above underscore for anyone familiar with the relative behavior of 'min' and 'argmin' in numerical optimization $Q_X$ is inherently less stable than $\overline{Q}_X$".

Can anyone elaborate upon this comment? Evidently I am not familiar with the relative behavior between 'min' and 'argmin'. He is saying that 'min' operations are more stable than 'argmin' operations? Why? I think I see that using 'min' rather than 'argmin' is easier to model, but I'm not sure I see the relevance of stability.

[1] Rockafellar, R. Tyrrell, and Johannes O. Royset. "Random variables, monotone relations, and convex analysis." Mathematical Programming 148.1-2 (2014): 297-331.

2

There are 2 best solutions below

4
On BEST ANSWER

As an extremely simple example, consider

$$ \text{argmin}_x ((x-1)^{2\ }+a)(x+1)^{2}$$

as $a$ varies from a small positive number to a small negative number.

Here is a Desmos graph you can play with to see what happens, or just look at these two screen shots:

$a = +0.05$

Graph of "W" shaped curve

$a= -0.05$

Graph of slightly different "W" shaped curve

You can see how the min moves continuously from zero to small negative values, but the argmin jumps from $x = -1$ to $x = +1$.

BTW, this is just an example, but you can show that, if the quantity you are minimizing is a continuous function of its parameters, then its minimum is also a continuous function of those parameters. But this example shows that you will not be able to prove the same thing for argmin.

2
On

If there are no restrictions on the function $f$, then the statement made about the "stability" of the two functions is false.

Here is a counter example. Let $f : \mathbb{R} \rightarrow \mathbb{R}$ denote the function given by $$f(x,y) = x^2 + y^2.$$ Let $g : \mathbb{R} \rightarrow \mathbb{R}$ be given by $$g(y) = \underset{x \in \mathbb{R}}{\text{argmin}}f(x,y) $$ and let $$ h(y) = \underset{x \in \mathbb{R}}{\min} f(x,y). $$ We observe that $g$ is a well-defined function as there is only a single $x$ which minimizes $x \rightarrow g(x,y)$ and $$ \forall y \in \mathbb{R} \: : \quad g(y) = 0.$$ Moreover, we have $$ \forall y \in \mathbb{R} \: : \quad h(y) = y^2.$$ It is clear that the function $h$ is more sensitive to changes in the input argument $y$, than the constant function $g$.

It is true that software which minimizes the value of a smooth function $f$ will tend to compute the value of the minimum $y=f(r)$ much more accurately than the location $x=r$ of the minimum. This is a consequence of Taylor's theorem.

Specifically, if $f :\mathbb{R} \rightarrow \mathbb{R}$ has a global minimum at $x = r$, then $f'(r) = 0$ and by Taylor's theorem we have $$ f(x) - f(r) = f'(r)(x-r) + \frac{1}{2}f''(r)(x-r)^2 + O((x-r)^3).$$ It follows that $$f(x) - f(r) \approx \frac{1}{2}f''(r)(x-r)^2$$ Ideally, the software terminates and returns $x$ such that $$|f(x)-f(r)| \approx\delta$$ where $\delta$ is some tolerance acceptable to the user. In this case we have $$|x-r| = O(\sqrt{\delta}).$$

We conclude that $f(r)$ is computed more accurately than $r$.

Of course this does not preclude the possibility of computing $r$ accurately. In particular, we can try to solve the equation $$f'(x) = 0$$ with respect to $x$. The set of solutions will certainly include $r$.

The real question which must be investigated is the conditioning of the two functions $$p \rightarrow Q_X(p) \quad\text{and}\quad p \rightarrow \bar{Q}_X(p).$$

The terms "conditioning" and "stability" are often used as if they are equivalent. They are not.

Conditioning refers to the mathematical problem and has no relation to the algorithm or machine being used to compute approximations. Stability refers to the quality of the algorithm being used to solve the problem and the relevant expressions typically involve constants native to the machine such as the unit roundoff.

The conditioning of a problem measures the sensitivity of the solution to small changes in the parameters that define the problem. The conditioning of a problem is quantified using condition numbers. There are different types of condition numbers: absolute, relative, componentwise, normwise and structured condition numbers. The choice of condition number depends on the exact situation, but all condition numbers are nonnegative.

If $ f : \mathbb{R} \rightarrow \mathbb{R}$ is differentiable, then the absolute condition number of $f$ at the point $x$ is merely $$\kappa_f^{\text{abs}}(x) = |f'(x)| \ge 0.$$

Functions which have a large condition number are said to be ill-conditioned, while functions with a small condition number are said to be well-conditioned.

Constant functions are well-conditioned and have the smallest possible condition number, i.e., zero.