Algorithm for a minimization problem of a function that is not differentiable under a constraint

29 Views Asked by At

I'm trying to understand an algorithm for a minimization problem but it is unclear. Here is the function we consider

$\lVert Y - X\beta\rVert_{2}^{2} + \lambda\lVert\beta\rVert_{1}$ where $Y\in\mathbb{R}^d$, $\beta\in\mathbb{R}^d$ and $X\in\mathbb{R}^{D\times d}$

This function is not differentiable everywhere since the $l-1$ norm is composed of the absolute value function from what I observe. I have find the following algorithm to try to minimize this function, and it is somewhat unclear :

  1. Put $\beta=0$
  2. Consider $\nabla g(\beta)=\lVert Y - X\beta\rVert_{2}^{2}$, we know that at the minimum, $\nabla g(\beta)=0$ so we will try to reach 0 step by step as follows
  3. Take the greatest coordinate of $\nabla g(\beta)$ in absolute value, say $\lvert\nabla g(\beta_i)\rvert$ and moves $\beta_i$ until $\lvert\nabla g(\beta_i)\rvert$ equals the second greatest value, say $\lvert\nabla g(\beta_j)\rvert$
  4. Then moves $\beta_i$ and $\beta_j$ such until $\lvert\nabla g(\beta_i)\rvert$ and $\lvert\nabla g(\beta_j)\rvert$ reach $\lvert\nabla g(\beta_k)\rvert$ and repeat this $d$ times
  5. Then we obtained a vector $\beta$ that "minimizes" $\nabla g(\beta)$ and moreover, we see that the largest coordinate, say $\lvert\nabla g(\beta_i)\rvert$ is the one that contributes the most to the minimization.

Here is my questions : Why do we put $\beta=0$ ? I thought it was to take the first order condition on $\lVert Y - X\beta\rVert_{2}^{2}$ but this is not the case.

It leads me to my second question : Why don't just take the derivatives with respect to $\lVert Y - X\beta\rVert_{2}^{2}$ since this expression is differentiable ?

Thank you a lot for your help !