Density Estimation and Analysis

56 Views Asked by At

This is an excerpt from BW Silverman's 'Density Estimation for Statistics and Data Analysis.'

The oldest and most widely used density estimator is the histogram. Given an origin $x_0$ and a bin width of $h$, we define the bins of the histogram to be the intervals $[x_0 + mh, x_0 + (m+1)h]$ for integers $m$. The histogram is defined by

$$ \hat{f}(x) = \dfrac{1}{nh}(\text{no. of $X_i$ in the same bin as $x$})$$


Please help me understand how this became so.

Since I am seeing $\hat{f}$, does this mean that we are talking about an estimator? I also did not understand the formula. I am not sure about the switch from $m$ to $n$, I just assume that they are the same thing.

Any insights would be appreciated.

1

There are 1 best solutions below

1
On BEST ANSWER

Yes, $\hat f$ is an estimator. But what it estimates is not a scalar quantity, or even a vector-valued parameter. Rather, it is an estimator of a function. The particular function being estimated is the true underlying probability density from which the sample was presumed to have been drawn. As such, it is what we would characterize as a nonparametric estimator: the distribution need not be a member of any particular parametric family.

The meaning of $\hat f$ is that you choose an "origin" $x_0$, and then partition the entire real line at points $$\{\ldots, x_0 - 2h, x_0 - h, x_0, x_0 + h, x_0 + 2h, \ldots\}.$$ Then you take your sample $(X_1, X_2, \ldots, X_n)$ containing a total of $n$ observations, and for each interval (called "bin") in your partition, you count the number of observations that fall into that interval/bin. Of course, many of these bins will not have any observations. But for those that do, you keep a tally. Then you divide the tallies by the product of the total number of observations and the width of the bin $h$. This gives you the height of the histogram in each bin.

Here is a concrete example. I will choose $x_0 = 0.5$, $h = 2$, and my sample is $$\{3, 7, 2, 1, 0, 5, 10, 3, 2, 4\}.$$ Then $n = 10$. My partition looks like this: $$\{\ldots, -1.5, 0.5, 2.5, 4.5, 6.5, 8.5, 10.5, \ldots \},$$ where I have kept only those endpoints that "cover" my data, because my smallest observation is $0$ and my largest is $10$.

Now I count: In $[-1.5, 0.5)$, there is one observation in my sample that falls in between. In $[0.5, 2.5)$, there are $3$ observations. And so forth. (It helps to sort the observations first.) The result is $(1, 3, 3, 1, 1, 1)$. So my histogram/density estimator is $$\hat f = \begin{cases} 0 & x < -1.5 \\ \frac{1}{20} & -1.5 \le x < 0.5 \\ \frac{3}{20} & 0.5 \le x < 4.5 \\ \frac{1}{20} & 4.5 \le x < 10.5 \\ 0 & 10.5 \le x. \end{cases}$$ This is a step function, and it has the property that it integrates to $1$, thus it is a true density function.

In practice, the more observations $n$ you have in your sample, and the narrower your bin width $h$, the more that the density estimator $\hat f$ will approach the "true" underlying density.