How is L2 regularization derived?

350 Views Asked by At

I just proved to myself why the regularization is added rather than multiplied to loss function.

I did so by taking the MLE formula...

$$argmax\sum log(P(x_{i}|\Theta ))$$

and since we know that MAP uses a prior belief distribution...

$$P(\Theta | x) = \frac{P(x|\Theta )P(\Theta )}{P(x)}$$

We can write MAP as...

$$argmax\sum log(P(x_{i}|\Theta )P(\Theta)$$

If we redistribute the logs, we can see that $log(P(\Theta))$ is the regularization terms, as shown below...

$$log(P(x_{i}|\Theta)) + log(P(\Theta))$$

but now I would like to show how L2 itself is derived. L2 is defined as...

$$\lambda \sum_{k}\sum_{l}W^{2}_{k,l}$$

which is the element-wise multiplication of the weights. Where did this equation come from? What values for $P(x_{i}|\Theta)$ and $P(\Theta)$, for example, do I need to use to derive this L2 formula? Can someone please explain it to me step-by-step?

1

There are 1 best solutions below

0
On BEST ANSWER

Take the prior $\mathbb{P}(\Theta)$ to be multivariate normal $\mathcal{N}_l(\mathbf{0},I),$ where $l$ is the dimension of the parameter $\Theta.$ $I$ is the $l\times l$ identity matrix.

Here is a reference