The following is from the paper The Modern Mathematics of Deep Learning Remark 1.3
Furthermore, one can consider density estimation tasks, where$\mathcal{X}=\mathcal{Z}, \mathcal{Y}:=[0, \infty]$, and $\mathcal{F}$ consists of probability densities with respect to some σ-finite reference measure µ on $\mathcal{Z}$. One then aims for a probability density $f_s$ that approximates the density of the unseen data $z$ with respect to µ. One can perform $L^2$(µ)- approximation based on the discretization$\mathcal{L}(f, z)=-2 f(z)+\|f\|_{L^{2}(\mu)}^{2}$ or maximum likelihood estimation based on the surprisal$\mathcal{L}(f, z)=-\log (f(z))$.
I don't understand the notation $\|f\|_{L^{2}(\mu)}^{2}$.Does $L^2$(µ) mean the L2-norm of the function $\mu$? How is $\mathcal{L}(f, z)=-2 f(z)+\|f\|_{L^{2}(\mu)}^{2}$ derived?
In this context, $L^2(\mu)$ is the $L^2$-space with the measure $\mu$. In other words, for a function $f$ one formally defines its $L^2(\mu)$ norm as $$ \|f\|_{L^2(\mu)}^2 : = \int |f(x)|^2 d\mu(x), $$ and one then defines $L^2(\mu)$ as the space of functions with finite $L^2(\mu)$ norm. This space also has a scalar product $\langle f, g \rangle_{L^2(\mu)} = \int f g d\mu = \mathbb{E}(f(G))$, where $G$ is a random variable distributed according to the density $g$.
The reason for the definition of $\mathcal{L}(f, z) = -2f(z) + \|f\|_{L^2(\mu)}^2$ is then the following: We have a density $h$ which describes the generation of the data $z$ and we would like to find a density $f$ such that $\|f - h\|_{L^2(\mu)}$ is minimal.
By the Pythagorean theorem, we can rewrite the expression as follows: $$ \|f - h\|_{L^2(\mu)}^2 = \|f\|^2_{L^2(\mu)} - 2\langle f, h\rangle_{L^2(\mu)} + \|h\|^2_{L^2(\mu)}. $$ The last term does not depend on $f$ and can hence be ignored in the minimisation. We get that $$ \min_{f} \|f - h\|_{L^2(\mu)}^2 = \min_{f} \|f\|^2_{L^2(\mu)} - 2\langle f, h\rangle_{L^2(\mu)} = \min_{f} \|f\|^2_{L^2(\mu)} - 2\mathbb{E}(f(H)), $$ where $\mathbb{E}$ is the expected value and $H$ is a random variable that is distributed according to the density $h$.
Now having i.i.d random variables $(z_i)_{i=1}^m$ distributed like $H$, we obtain that $$ \frac{1}{m}\sum_{i=1}^m \mathcal{L}(f,z_i) = \frac{1}{m}\sum_{i=1}^m \left(\|f\|^2_{L^2(\mu)} - 2f(z_i) \right)\to \|f\|^2_{L^2(\mu)} - 2\mathbb{E}(f(H)) $$ almost surely, by the law of large numbers.