should the 2 in L-2 norm notation be a subscript or superscript?

4k Views Asked by At

I'm currently taking an online course where the professor has a habit of writing norms like this:

$||a^{[l](C)} - a^{[l](G)} ||^2$

Since I don't have a great amount of experience in math or the concepts of deep learning, I was often confused whether the 2 simply meant, in conjunction with the double bars, "apply the L2 norm to the terms within, i.e. square each of them and then sum the result" or if the 2 was, itself, a squaring of whatever the double brackets meant on their own.

So I googled for confirmation of the notation and it seems that the 2 is normally written as a subscript, not a superscript. For example you can see the notation in Wikipedia is written this way: https://en.wikipedia.org/wiki/Norm_(mathematics)#Notation

So is the superscript notation wrong? Or is this just one of those unfortunate cases where there is no standard?

2

There are 2 best solutions below

2
On BEST ANSWER

Since I don't have a great amount of experience in math or the concepts of deep learning, I was often confused whether the 2 simply meant, in conjunction with the double bars, "apply the L2 norm to the terms within, i.e. square each of them and then sum the result" or if the 2 was, itself, a squaring of whatever the double brackets meant on their own.

I have never seen an author disambiguate the norm delimiters $\lVert\quad\rVert$ through the use of a superscript. In analysis, such notation would be incredibly confusing, since we frequently need to establish inequalities among norms of vectors raised to some power.

Also, an $L^2$ norm of a vector is the square root of the sum of the absolute squares of its components: $$\lVert x\rVert_2=\sqrt{\sum_{i=1}^n\lvert x_i\rvert^2}\text{;}$$ consequently, $$\lVert x\rVert^2_2=\sum_{i=1}^n\lvert x_i\rvert^2\text{.}$$

0
On

Just to add to this: since OP mentioned "concepts of deep learning", I'm guessing that the expressions that contain these norms appear in loss functions. Usually, regressive loss functions have a square because they have nicer derivatives for gradient descent.

For example, with a regularisation term added, you would see something like $$ L = \frac{1}{2}\sum_{i=1}^N (y_i - f(\vec x_i))^2 + \frac{\lambda}{2}||\vec w||^2 $$ with $f(\vec x) = \vec w \cdot \vec x$. Lots of squares here, but differentiating to any $w_j$, we have $$ \frac{\partial L}{\partial w_j} = \sum_{i=1}^N (f(\vec x_i) - y_i) x_j + \lambda w_j $$ which would have been a lot less nice without the squares. Other norms are sometimes also used (e.g. $||\vec w||_1$ for L1-regularisation), but the convention in machine learning is always that $||\vec w||$ means $||\vec w||_2$ by default.