Statistical method of giving an upper estimate for $\|f\|_2^2$ from a data

99 Views Asked by At

Suppose, that we are given an i.i.d. sample of $(X_i,Y_i), i=1,...,n$ input-output pairs, where $Y_i=f(X_i)+\varepsilon_i$, where $f: [0,1] \to [0,1]$ is the data-generating function, and $\varepsilon_i$ are the independent noise terms.

The task would be to give an upper approximation for the $L_2$-norm of the data generating function, namely:

$$\|f\|_2^2 = \int |f(x)|^2 \,dx$$

I have done some research, but I couldn't find any results which provide a suitable (let's denote it with $\kappa$) upper bound for this norm, when one is given a noisy input-output data pair.

For the noise-free case (namely, when $\varepsilon_k = 0$ for every $k = 1,...,n$), a suitable approximation is the following term:

$$\kappa := \frac{1}{n} \sum_{i=1}^n Y_i^2 \Rightarrow \mathbb{E}(\kappa) = \mathbb{E}\Big(\frac{1}{n}\sum_{i=1}^n Y_i^2\Big) = \mathbb{E} \Big(\frac{1}{n}\sum_{i=1}^n (f(X_i) \Big)^2 = \|f\|_2^2.$$

However, once the noise terms are not all $0$, this term looks like this:

$$\frac{1}{n} \sum_{i=1}^n Y_i^2 = \frac{1}{n} \sum_{i=1}^n (f(X_i)+\varepsilon_i)^2 = \frac{1}{n} \sum_{i=1}^n f(X_i)^2 + \varepsilon_i^2 + 2 f(X_i) \varepsilon_i$$

If we use the following notion (the difference between the noisy and the noise-free case)

$$D := \frac{1}{n} \sum_{i=1}^n \varepsilon_i^2 + 2 f(X_i) \varepsilon_i,$$

we can easily obtain that $\mathbb{E}(D) > 0$, because $\mathbb{E}(\varepsilon_i^2) > 0$, therefore this approximation doesn't hold on the noisy case.

I am gladly looking for any statistical methods or ideas, which provide me an upper estimate of this function norm on the noisy case.

Any help is greatly appreciated!

2

There are 2 best solutions below

1
On BEST ANSWER

(1) You ARE giving an upper bound on the $L^2$ norm of $f$. More precisely, your estimate converges to (and for any $n$ is equal to in expectation) $\|f\|^2_2+\mathbb{V}(\epsilon)$, where $\mathbb{V}(\epsilon)$ is the variance of the noise. Clearly this is larger than $\|f\|^2_2$.

(2) To get a better upper bound, i.e. to get an estimator that converges to a number that is closer to $\|f\|^2_2$ you need an estimate of $\mathbb{V}(\epsilon)$ (which you can then subtract from your first estimator). If you don't have this a priori, the only way you can get it from your samples is to assume more about $f$ than mere square integrability. In fact, if all you know is that your function is square integrable, you can't distinguish $(f, \epsilon)$ from $(g,0)$ using your samples. Here $g(x):=f(x)+\tilde{\epsilon}(x)$ where $\tilde\epsilon$ is any function with the same distribution as $\epsilon$ and orthogonal to $f$. In other words, you can't tell by looking at your samples whether the sample variance is the true $L^2$ norm of your function or whether there is noise inflating the sample variance and the true $L^2$ norm of your function is smaller than the sample variance.

(3) If you have a bound of the form $\|\nabla f\|_{\infty}<C$ with some $C$ known to you (i.e., you do know more about your function than simply that it is square integrable) then here is one concrete way to get an estimate on $\mathbb{V}(\epsilon)$: make pairs of samples for which $\|X_{i_{k1}} - X_{i_{k2}}\|<\delta$ and define $Z_k:=(Y_{i_{k1}} - Y_{i_{kw}})^2/2$. The average $\overline{Z}:=\frac{1}{K}\sum_k Z_k$ has expectation $\mathbb{V}(\epsilon)+M$ where $M<C^2\delta^2/2$. By making $\delta$ smaller you get this expectation closer to $\mathbb{V}(\epsilon)$ but you reduce your number $K$ of sample pairs, i.e. you increase the variance of the estimator of $\mathbb{V}(\epsilon)$.

0
On

You need some a priori information on what the function $f$ could look like if you want to use this in a practical setting. For an extreme example consider a function that is constant zero on the interval $[0,1]$ except on some tiny interval $[0.5-\varepsilon, 0.5+\varepsilon]$ for some small epsilon. Inside the tiny interval put a smooth huge spike so that $\int_0^1f(x)dx=1$. In the limit for $n$ going to infinity you will recover the function but for 'small' n your estimate for $f$ will just be the constant zero function. For sufficiently small $\varepsilon$, the 'small' n can be arbitrary large.

A useful bound for applications is to assume that $f$ is continuous and at least one-time differentiable and that there exists some constant $C$ such that $|f'(x)|<C$ for all $x$ where $C$ is chosen depending on your application. If you use that extra information you can get good estimates on how well you can approximate $f$ depending on $n$.

Edit: Some more thoughts on the noise terms. You wrote that they are independent but I think you need to additionally assume that they follow (or are bounded by) some known distribution with fixed parameters.

The easiest case is if you know a priori that the errors are always bounded by some absolute value $c$. Then you can simply put the estimate $\varepsilon_i <c $ into your equations and get an upper bound.

The more common case is to assume that the error terms are normally distributed with mean zero and some known variance $\sigma^2$. Unfortunately this means that in principle an error could be arbitrarily large but that would be astronomically unlikely. Hence you define some confidence interval, say $99\%$ and then compute the bounds for this interval. You can then say with $99\%$ confidence the errors are within these absolute bounds and then you plug the bounds into your equations similarly to the case of absolute bounds.