I have recently started studying about kernel density estimations. For reference, if one is given if $(x_1,x_2,...,x_n)$ is an i. i. d. sample drawn from an (unknown) distribution with an unknown $f$ density function, then the kernel density estimator of $f$ is the following: $$\hat{f}_h(x) := \frac{1}{nh}\sum_{i=1}^n K \Big(\frac{x-x_i}{h} \Big),$$ where $K$ is the kernel.
I started to look for any literature, which states something about the "effectiveness" of this estimation, such as 1 and 2.
What I seem unanble to find is a theoretical guarantee for the norm of the difference between the original and the estimated density function, namely the term $||\hat{f}_h - f||_p$ for some $p$.
Is there a theorem that states something as follows: for every $\varepsilon > 0$, there exists a $\delta > 0$, such that $$\mathbb{P}(||\hat{f}_h - f||_p > \varepsilon) < 1-\delta?$$
Any help is greatly appreciated!
Yes, there are results bounding $\lVert\hat{f}_h - f\rVert_p$ for any $p$. The results depend on the properties of the kernel $K$, the smoothness $s$ of the density $f$, the number of dimensions $d$, and the decay rate of the bandwidth $h$.
Roughly speaking, for an $s$-Hölder density $f$ on $\mathbb R^d$, if your kernel $K$ satisfies $\int x^\alpha K(x) = 0$ for all multi-indices $\alpha$ with $\lvert \alpha \rvert \in [1, s-1]$, and your bandwidth $h$ decays at the asymptotic rate: $$ h_n \sim\begin{cases} n^{-1/(2s+d)}, &\quad p < \infty,\\ (n/\log(n))^{-1/(2s+d)}, &\quad p = \infty, \end{cases}$$ you can expect the optimal rates of convergence: $$ \mathbb E [\lVert \hat f_h - f \rVert_p] = \begin{cases} O(n^{-s/(2s+d)}), &\quad p < \infty,\\ O((n/\log(n))^{-s/(2s+d)}), &\quad p = \infty.\end{cases} $$
You can get an introduction to these results in section 1.2 of Tsybakov's Introduction to Nonparametric Estimation; for more details, you might want to check out the references in this paper by Goldenshluger and Lepski.
Edit: it's worth noting that if you want finite-time error bounds, or confidence sets for $f$, these aren't possible without prior knowledge of $f$. As an example, fix a kernel $K$ and bandwidth $h$, and consider the error in estimating the densities $f_k(x) = 1 + \cos(k \pi x)$ on $[0, 1]$, for integers $k \to \infty$.