What does L1 regularization for multiclass discriminative classification look like?

181 Views Asked by At

In general, L1 regularization takes the form of $\mathcal{L}(w,X) + \lambda R(w)$, where $\mathcal{L}$ is a loss function, $w$ is a vector of weights, and $X$ is the data matrix, and $\lambda R(w)$ is the regularization term with regularization parameter $\lambda$. In the case of L1 regularization, I've always seen $R(w) = ||w||_1$.

In the case of multiclass discriminative classification (say multiclass logistic regression for example) with $k$ classes, $W$ is now a weight matrix which has either $k$ columns or $k$ rows depending on convention. In this case, is L1 regularization still given by $R(W) = ||W||_1$, where $||W||_1$ is the matrix L1 norm (max column sum)? Or is it a linear combination of the L1 norms of each column (or row)? In other words, is the regularization term given by $$\lambda ||W||_1$$ or $$\lambda_1||w_1||_1 + \lambda_2||w_2||_1 + \lambda_3||w_3||_1,$$where $w_i$ is the $i$th column of $W$?

For reference, I am trying to determine what the loss function at the bottom of page 15 on the paper linked below actually looks like.

https://www.cs.ubc.ca/cgi-bin/tr/2009/TR-2009-19.pdf

1

There are 1 best solutions below

0
On BEST ANSWER

It is to be understood as $$ \lVert W \rVert_1 = \sum_{i} \lvert w_i \rvert, $$ i.e. you collect all weights in a vector and take the $L_1$ norm (same for $L_2$ by the way).

For reference see e.g. in the book "Deep Learning" by Goodfellow et. al. sections 7.1 and 7.1.2.