What are some of the common techniques for density estimation?

112 Views Asked by At

I'm trying to estimate the probability density function of a real random variable given its iid realizations. What are some of the standard techniques to do this? Also, is there any reasonable assumptions on the smoothness of the probability density function that's being commonly used in this field?

I'd greatly appreciate any of your pointers or references.

2

There are 2 best solutions below

1
On BEST ANSWER

Traditionally, histograms have been used as density estimators. Various algorithms for the number of bins, based on the number of observations, are in use. Rationales for these algorithms are based on the goal that the histogram should give an idea of the shape of the population distribution. Section 3.2 of the Wikipedia article on histograms shows several commonly implemented rules.

However, in binning the data to make a histogram, some information is lost. Accordingly, an empirical CDF curve (jumping by $\frac{1}{n}$th at each observation) is often a clearer indication of the population CDF than a histogram is of the population PDF (density). The plots below illustrate this for a sample of size $n = 100$ from $\mathsf{Gamma}(shape=5,rate=1/10).$

enter image description here

set.seed(1776);  x = rgamma(100, 5, .1)
par(mfrow=c(1,2))
  hist(x, prob=T, col="skyblue2", main="Histogram: Sample of 100 from GAMMA(5,rate=.1)")
    curve(dgamma(x,5, .1), col="blue", lwd=3, add=T)
  plot(ecdf(x), main="ECDF: Sample of 100 from Gamma(4, rate=.1)"); 
    curve(pgamma(x, 5,.1), lwd=3, col="blue", add=T)
par(mfrow=c(1,1))

One modern method of density estimation, usually implemented by computer, in a kernel density estimator (KDE). Roughly, curves are generated to approximate the population density in various subintervals of the span of the data, and then the the curves are fit together to form a smooth curve. The term 'kernel' refers to the type of curve (perhaps part of a normal density curve), and the term 'bandwidth' refers to the length of the subintervals. You can see the relevant Wikipedia article for a more technical explantation; I have found the book by Silverman (1985), referenced there to be a very clearly written starting place.

KDEs are implemented in R statistical software. The figure below shows the density curve (cyan) of $\mathsf{Norm}(\mu=100, \sigma=15),$ a histogram of a random sample of size $n=500$ from this distribution, and the default KDE (red) from R based on the sample. (The tick marks below the horizontal axis show exact locations of the 500 observations.)

enter image description here

set.seed(1234); m = 500;  mu = 100; sg = 15
x = rnorm(m, mu, sg)
hist(x, prob=T, col="skyblue3", main="Sample of 500 from NORM(100,15)")
curve(dnorm(x, mu, sg), lwd = 2, col="cyan", add=T)
lines(density(x), lwd=3, col="red")  # density estimator implemented here

Note: Of course, if you know the population is normal, then it is better to use this information and to estimate $\mu$ by $\bar X = 100.04$ and $\sigma$ by $S = 15.5.$ Then plot a normal density curve using these estimates.

mean(x);  sd(x)
## 100.0276
## 15.52221
0
On

The most common methods are Histograms and Kernel Density Estimation. Good references are "All of Nonparameteric Statistics" by Wasserman and "Density Estimation for Statistics and Data Analysis" by Silverman.

Yes, various smoothness assumptions play a role when you want to prove convergence of the estimate to the true density - see the suggested references for more info.