Suppose, that we are given an i.i.d. sample of $(X_i,Y_i), i=1,...,n$ input-output pairs, where $Y_i=f(X_i)+\varepsilon_i$, where $f: [0,1] \to [0,1]$ is the data-generating function, and $\varepsilon_i$ are the independent noise terms.
The task would be to give an upper approximation for the $L_2$-norm of the data generating function, namely:
$$\|f\|_2^2 = \int |f(x)|^2 \,dx$$
I have done some research, but I couldn't find any results which provide a suitable (let's denote it with $\kappa$) upper bound for this norm, when one is given a noisy input-output data pair.
For the noise-free case (namely, when $\varepsilon_k = 0$ for every $k = 1,...,n$), a suitable approximation is the following term:
$$\kappa := \frac{1}{n} \sum_{i=1}^n Y_i^2 \Rightarrow \mathbb{E}(\kappa) = \mathbb{E}\Big(\frac{1}{n}\sum_{i=1}^n Y_i^2\Big) = \mathbb{E} \Big(\frac{1}{n}\sum_{i=1}^n (f(X_i) \Big)^2 = \|f\|_2^2.$$
However, once the noise terms are not all $0$, this term looks like this:
$$\frac{1}{n} \sum_{i=1}^n Y_i^2 = \frac{1}{n} \sum_{i=1}^n (f(X_i)+\varepsilon_i)^2 = \frac{1}{n} \sum_{i=1}^n f(X_i)^2 + \varepsilon_i^2 + 2 f(X_i) \varepsilon_i$$
If we use the following notion (the difference between the noisy and the noise-free case)
$$D := \frac{1}{n} \sum_{i=1}^n \varepsilon_i^2 + 2 f(X_i) \varepsilon_i,$$
we can easily obtain that $\mathbb{E}(D) > 0$, because $\mathbb{E}(\varepsilon_i^2) > 0$, therefore this approximation doesn't hold on the noisy case.
I am gladly looking for any statistical methods or ideas, which provide me an upper estimate of this function norm on the noisy case.
Any help is greatly appreciated!
(1) You ARE giving an upper bound on the $L^2$ norm of $f$. More precisely, your estimate converges to (and for any $n$ is equal to in expectation) $\|f\|^2_2+\mathbb{V}(\epsilon)$, where $\mathbb{V}(\epsilon)$ is the variance of the noise. Clearly this is larger than $\|f\|^2_2$.
(2) To get a better upper bound, i.e. to get an estimator that converges to a number that is closer to $\|f\|^2_2$ you need an estimate of $\mathbb{V}(\epsilon)$ (which you can then subtract from your first estimator). If you don't have this a priori, the only way you can get it from your samples is to assume more about $f$ than mere square integrability. In fact, if all you know is that your function is square integrable, you can't distinguish $(f, \epsilon)$ from $(g,0)$ using your samples. Here $g(x):=f(x)+\tilde{\epsilon}(x)$ where $\tilde\epsilon$ is any function with the same distribution as $\epsilon$ and orthogonal to $f$. In other words, you can't tell by looking at your samples whether the sample variance is the true $L^2$ norm of your function or whether there is noise inflating the sample variance and the true $L^2$ norm of your function is smaller than the sample variance.
(3) If you have a bound of the form $\|\nabla f\|_{\infty}<C$ with some $C$ known to you (i.e., you do know more about your function than simply that it is square integrable) then here is one concrete way to get an estimate on $\mathbb{V}(\epsilon)$: make pairs of samples for which $\|X_{i_{k1}} - X_{i_{k2}}\|<\delta$ and define $Z_k:=(Y_{i_{k1}} - Y_{i_{kw}})^2/2$. The average $\overline{Z}:=\frac{1}{K}\sum_k Z_k$ has expectation $\mathbb{V}(\epsilon)+M$ where $M<C^2\delta^2/2$. By making $\delta$ smaller you get this expectation closer to $\mathbb{V}(\epsilon)$ but you reduce your number $K$ of sample pairs, i.e. you increase the variance of the estimator of $\mathbb{V}(\epsilon)$.