Dear optimization experts,
My apologies for asking probably the well-known relation between the Huber-loss based optimization and $\ell_1$ based optimization. However, I am stuck with a 'first-principles' based proof (without using Moreau-envelope, e.g., here) to show that they are equivalent.
Problem formulation
The observation vector is \begin{align*} \mathbf{y} &= \mathbf{A}\mathbf{x} + \mathbf{z} + \mathbf{\epsilon} \\ \begin{bmatrix} y_1 \\ \vdots \\ y_N \end{bmatrix} &= \begin{bmatrix} \mathbf{a}_1^T\mathbf{x} + z_1 + \epsilon_1 \\ \vdots \\ \mathbf{a}_N^T\mathbf{x} + z_N + \epsilon_N \end{bmatrix} \end{align*} where
- $\mathbf{A} = \begin{bmatrix} \mathbf{a}_1^T \\ \vdots \\ \mathbf{a}_N^T \end{bmatrix} \in \mathbb{R}^{N \times M}$ is a known matrix
- $\mathbf{x} \in \mathbb{R}^{M \times 1}$ is an unknown vector
- $\mathbf{z} = \begin{bmatrix} z_1 \\ \vdots \\ z_N \end{bmatrix} \in \mathbb{R}^{N \times 1}$ is also unknown but sparse in nature, e.g., it can be seen as an outlier
- $\mathbf{\epsilon} \in \mathbb{R}^{N \times 1}$ is a measurement noise say with standard Gaussian distribution having zero mean and unit variance normal, i.e. $\mathcal{N}(0,1)$.
We need to prove that the following two optimization problems P$1$ and P$2$ are equivalent.
P$1$: \begin{align*} \text{minimize}_{\mathbf{x},\mathbf{z}} \quad & \lVert \mathbf{y} - \mathbf{A}\mathbf{x} - \mathbf{z} \rVert_2^2 + \lambda\lVert \mathbf{z} \rVert_1 \end{align*}
and
P$2$: \begin{align*} \text{minimize}_{\mathbf{x}} \quad & \sum_{i=1}^{N} \mathcal{H} \left( y_i - \mathbf{a}_i^T\mathbf{x} \right), \end{align*} where the Huber-function $\mathcal{H}(u)$ is given as $$\mathcal{H}(u) = \begin{cases} |u|^2 & |u| \leq \frac{\lambda}{2} \\ \lambda |u| - \frac{\lambda^2}{4} & |u| > \frac{\lambda}{2} \end{cases} . $$
My partial attempt following the suggestion in the answer below
We attempt to convert the problem P$1$ into an equivalent form by plugging the optimal solution of $\mathbf{z}$, i.e.,
\begin{align*} \text{minimize}_{\mathbf{x},\mathbf{z}} \quad & \lVert \mathbf{y} - \mathbf{A}\mathbf{x} - \mathbf{z} \rVert_2^2 + \lambda\lVert \mathbf{z} \rVert_1 \\ \equiv \\ \text{minimize}_{\mathbf{x}} \left\{ \text{minimize}_{\mathbf{z}} \right. \quad & \left. \lVert \mathbf{y} - \mathbf{A}\mathbf{x} - \mathbf{z} \rVert_2^2 + \lambda\lVert \mathbf{z} \rVert_1 \right\} \end{align*}
Taking derivative with respect to $\mathbf{z}$, \begin{align} 0 & \in \frac{\partial}{\partial \mathbf{z}} \left( \lVert \mathbf{y} - \mathbf{A}\mathbf{x} - \mathbf{z} \rVert_2^2 + \lambda\lVert \mathbf{z} \rVert_1 \right) \\ \Leftrightarrow & -2 \left( \mathbf{y} - \mathbf{A}\mathbf{x} - \mathbf{z} \right) + \lambda \partial \lVert \mathbf{z} \rVert_1 = 0 \\ \Leftrightarrow & \quad \left( \mathbf{y} - \mathbf{A}\mathbf{x} - \mathbf{z} \right) = \lambda \mathbf{v} \ . \end{align} for some $ \mathbf{v} \in \partial \lVert \mathbf{z} \rVert_1 $ following Ryan Tibshirani's lecture notes (slide#18-20), i.e., \begin{align} v_i \in \begin{cases} 1 & \text{if } z_i > 0 \\ -1 & \text{if } z_i < 0 \\ [-1,1] & \text{if } z_i = 0 \\ \end{cases}. \end{align} Then, the subgradient optimality reads: \begin{align} \begin{cases} \left( y_i - \mathbf{a}_i^T\mathbf{x} - z_i \right) = \lambda \ {\rm sign}\left(z_i\right) & \text{if } z_i \neq 0 \\ \left| y_i - \mathbf{a}_i^T\mathbf{x} - z_i\right| \leq \lambda & \text{if } z_i = 0 \end{cases} \end{align} Also, following, Ryan Tibsharani's notes the solution should be 'soft thresholding' $$\mathbf{z} = S_{\lambda}\left( \mathbf{y} - \mathbf{A}\mathbf{x} \right),$$ where \begin{align} S_{\lambda}\left( y_i - \mathbf{a}_i^T\mathbf{x} \right) = \begin{cases} \left( y_i - \mathbf{a}_i^T\mathbf{x} - \lambda \right) & \text{if } \left(y_i - \mathbf{a}_i^T\mathbf{x}\right) > \lambda \\ 0 & \text{if } -\lambda \leq \left(y_i - \mathbf{a}_i^T\mathbf{x}\right) \leq \lambda \\ \left( y_i - \mathbf{a}_i^T\mathbf{x} + \lambda \right) & \text{if } \left( y_i - \mathbf{a}_i^T\mathbf{x}\right) < -\lambda \\ \end{cases} . \end{align}
Now, we turn to the optimization problem P$1$ such that \begin{align*} \text{minimize}_{\mathbf{x}} \left\{ \text{minimize}_{\mathbf{z}} \right. \quad & \left. \lVert \mathbf{y} - \mathbf{A}\mathbf{x} - \mathbf{z} \rVert_2^2 + \lambda\lVert \mathbf{z} \rVert_1 \right\} \\ \equiv \end{align*}
\begin{align*} \text{minimize}_{\mathbf{x}} \quad & \lVert \mathbf{y} - \mathbf{A}\mathbf{x} - S_{\lambda}\left( \mathbf{y} - \mathbf{A}\mathbf{x} \right) \rVert_2^2 + \lambda\lVert S_{\lambda}\left( \mathbf{y} - \mathbf{A}\mathbf{x} \right) \rVert_1 \end{align*}
- if $\lvert\left(y_i - \mathbf{a}_i^T\mathbf{x}\right)\rvert \leq \lambda$, then So, $\left[S_{\lambda}\left( y_i - \mathbf{a}_i^T\mathbf{x} \right)\right] = 0$.
the objective would read as $$\text{minimize}_{\mathbf{x}} \sum_i \lvert y_i - \mathbf{a}_i^T\mathbf{x} \rvert^2, $$ which is easy to see that this matches with the Huber penalty function for this condition. Agree?
- if $\lvert\left(y_i - \mathbf{a}_i^T\mathbf{x}\right)\rvert \geq \lambda$, then $\left( y_i - \mathbf{a}_i^T\mathbf{x} \mp \lambda \right)$.
the objective would read as $$\text{minimize}_{\mathbf{x}} \sum_i \lambda^2 + \lambda \lvert \left( y_i - \mathbf{a}_i^T\mathbf{x} \mp \lambda \right) \rvert, $$ which almost matches with the Huber function, but I am not sure how to interpret the last part, i.e., $\lvert \left( y_i - \mathbf{a}_i^T\mathbf{x} \mp \lambda \right) \rvert$. Please suggest...
The idea is much simpler. Use the fact that $$\min_{\mathbf{x}, \mathbf{z}} f(\mathbf{x}, \mathbf{z}) = \min_{\mathbf{x}} \left\{ \min_{\mathbf{z}} f(\mathbf{x}, \mathbf{z}) \right\}.$$ In your case, the solution of the inner minimization problem is exactly the Huber function.