Kernel density estimation -Effect of bandwidth

1.3k Views Asked by At

I am trying to learn Kernel density estimation, I need help to understand how the bandwidth $h$ affects the Kernel density estimator. Consider a Gaussian Kernel $k(x)~=~\frac{1}{\sqrt{2 \pi}} e^{-x^2}$. The Kernel density estimator is given by ${\hat{f}}_h (x) ~=~ \frac{1}{h} \sum_{i=1}^{n} K_h(x-X_i)$.

Clearly, $k(x)$ is independent of $h$, where does $h$ come in? What would be ${\hat{f}}_h (x)$? How does $h$ affect the Kernel?

Thank you!

3

There are 3 best solutions below

1
On

Density estimators are of the form:

$$ \hat{f}(x_0) = \frac{1}{nh} \sum_{i=1}^n K \left ( \frac{x_i - x_0}{h} \right ) $$

For any choice of kernel, the bandwidth $h$ is a smoothing parameter, and controls how smooth the fit is by controlling the size of the neighbourhood around the reference, $x_0$.

If $h$ is large, we consider a large neighbourhood, and vice versa.

In the Gaussian kernel case, varying $h$ has the same effect as varying the variance of a Gaussian. Small $h$ leads to a thinner, more peaked Gaussian, whereas larger $h$ leads to a fatter Gaussian, in the extreme case, closer and closer to a flat line.

0
On

Most simply and intuitively: For a given sample size and population structure, a small bandwidth will do less smoothing than a large one.

Each panel in the plot below shows the histogram of a sample of size $n = 100$ from $\mathsf{Norm}(\mu=100,\,\sigma=15).$ Beneath the histogram, tick marks show exact locations of the observations. The density estimators (green curves) use the default Gausssian kernel; from left to right the bandwidths are the default multiplied by .5, 1, and 2, respectively.

enter image description here

A similar figure, but with $n = 500.$

enter image description here

You can read the particulars in the R help screen for 'density' and its references. I especially recommend the book by Bernard Silverman.


R code for the second figure is given below, in case you want it. Each run gives a different sample. With sample sizes as small as 100 and 500, results vary considerably from one run to another. However, the general principle that a larger bandwith gives a smoother density estimator is evident in almost all runs.

par(mfrow=c(1,3))
x = rnorm(500, 100, 15)
hist(x, prob=T, col="skyblue2", main="Small Bandwidth"); rug(x)
  lines(density(x, adj=.5), lwd=2, col="darkgreen")
hist(x, prob=T, col="skyblue2", main="Default Bandwidth"); rug(x)
  lines(density(x), lwd=2, col="darkgreen")
hist(x, prob=T, col="skyblue2", main="Large Bandwidth"); rug(x)
  lines(density(x, adj=2), lwd=2, col="darkgreen")
par(mfrow=c(1,1))
0
On

the optimal bandwidth should maximize the pseudo-likelihood $\mathcal{L}(h)=\prod_{j=1}^{n}\hat{f_j}(x_j|h)$ where $\hat{f_j}(x_j|h)$ is the leave-one-out density estimate with the $i=j$ term in sum omitted. So you'd solve for $\frac{\partial \mathcal{L}}{\partial h}=0 $ where $\frac{\partial^2 \mathcal{L}}{\partial h^2}<0 $ (n.b. often times more tractable to maximize ln-likelihood)