A dual norm optimization problem

539 Views Asked by At

I'm reading this machine learning optimization paper https://arxiv.org/pdf/2010.01412.pdf. At the last formula of page 3, they derived an optimization problem like this:

${\bf{\epsilon^*(w)}} = \underset{||\bf{\epsilon}||_p \leq\rho}{\operatorname{argmax}} \bf{\epsilon^{T}\nabla_{w}L_s(w)}$ (1)

They said this is a classical dual norm problem and the solution is:

$\bf{\hat\epsilon(w) = \rho sign(\nabla_{w}L_s(w))}|\nabla_{w}L_s(w)|^{q-1}/(||\nabla_{w}L_s(w)||_q^q)^{\frac{1}{p}}$ (2)

with $\frac{1}{p}+\frac{1}{q} = 1$ and $|.|^{q-1}$ denotes elementwise absolute value and power.

Can anyone please show me how to solve the optimization problem to arrive at the second formulas. I really appreciate.

1

There are 1 best solutions below

4
On BEST ANSWER

First off, to reduce unnecessary clutter, let $$ x=\nabla\mathbf{_wL_s(w)}\ , $$ and $$ \hat{\epsilon}=\hat{\epsilon}\mathbf{(w)} . $$ Then equation $(2)$ becomes $$ \hat{\epsilon}=\frac{\rho\,\mathbf{sign}(x)|x|^{q-1}}{\|x\|_q^\frac{q}{p}}\ . $$ Immediately before their equation $(2)$, the authors of your cited paper note that "$\ |\cdot|^{q-1}\ $ denotes elementwise absolute value and power". Although they don't say so explicitly, the same interpretation must be applied to the function $\ \mathbf{sign(\cdot)}\ $. The equation can therefore be written as $$ \hat{\epsilon}_i=\frac{\rho\,\mathbf{sign}(x_i)|x_i|^{q-1}}{\|x\|_q^\frac{q}{p}}\ . $$ With $\ \hat{\epsilon}\ $ thus defined, a little algebraic manipulation, making liberal use of the identity $\ p+q=pq\ $, gives $$ \|\hat{\epsilon}\|_p=\rho $$ and $$ \hat{\epsilon}^\mathbf{T}x=\rho\|x\|_q\ . $$ But Hölder's inequality tells us that $$ \epsilon^\mathbf{T}x\le\|\epsilon\|_p\|x\|_q\le \rho\|x\|_q $$ if $\ \|\hat{\epsilon}\|_p\le\rho\ $. Thus, on the closed ball $\ \big\{\,\epsilon\,\big|\,\|\epsilon\|_p\le\rho\big\}\ $, the linear function $\ \epsilon^\mathbf{T}x\ $ of $\ \epsilon\ $ is bounded above by $\ \rho\|x\|_q\ $, and achieves that bound for $\ \epsilon=\hat{\epsilon}\ $. It follows that $$ \hat{\epsilon}=\arg\max_{\|\epsilon\|_p\le\rho}\epsilon^\mathbf{T}x\ . $$