Error on the Bin of a Normalised Histogram

1k Views Asked by At

Suppose I have a histogram, $N$, each with bins of width $\Delta x$, denoted by bin indices, $i$. The count of a single bin is then $N_{i}$.

I wish to estimate the empirical density for a certain bin. Defining $N_{Total} = \sum^{dim(N)}_{i=0} N_{i}$ and $\hat{N}_{i}$ as the empirical density of a single bin, I can use:

$\hat{N_{i}} = \frac{N_{i}}{N_{total}\Delta x}$

This is a standard method for measuring empirical density. For instance, see here.

Suppose I then wish to calculate the error. By chain rule and the assumption of the Poisson distribution for points in the bin, such that $dN_{i} = \sqrt{N_{i}}$ and $dN_{Total} = \sqrt{N_{Total}}$.

My solution for the error, using chain rule and solving, is:

$d \hat{N_{i}} = \hat{N_{i}}\sqrt{\frac{1}{N_{i}} + \frac{1}{N_{Total}}}$.

This solution, however, did not include the error on $\Delta x$, which I am fairly sure has an influence on the value $\hat{N_{i}}$ and thus $d\hat{N_{i}}$.

Is my solution correct as is?

Should the error on $\Delta x$ be included in the calculation as well, if this is even applicable?

If so, should the error on the bin width be $d \Delta x = \frac{\Delta x}{2}$?

1

There are 1 best solutions below

1
On

Some comments on density estimation: As you pursue your efforts to approximate a population density by histograms, here is some background information you may find helpful.

Density histograms. In R, you can use the parameter prob=T with the hist procedure to get a histogram in which the total area of all bars in the histogram is $1.$ That makes it feasible to plot on the same axes the density curve of the continuous distribution from which the data were randomly sampled. For reasonably large samples there is usually a good match between the histogram and the density function.

Consider a random sample x of size $n = 500$ from the distribution $\mathsf{Gamma}(\mathrm{shape}=\alpha=6, \mathrm{rate} = \lambda = 0.1),$ which has mean $\mu = \alpha/\lambda = 60.$

set.seed(429)
x = rgamma(500, 6, .1)
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.696  40.811  56.335  58.977  74.251 156.755 

hdr = "n = 500: GAMMA(6, .1)"
hist(x, prob=T, ylim=c(0,.02), col="skyblue2", label=T, main=hdr)
 curve(dgamma(x, 6, .1), add=T, col="brown", lwd=2) 

enter image description here

The width of each bar is 20 and I have put labels to show the heights (densities) of the bars. [Also. see the Note at the end.]

Kernel density estimation (KDE). A kernel density estimator seeks to approximate the population density function by using a mixture of 'kernels' (shapes, which can be chosen to be rectangles, normal density functions, etc.) You may want to read the Wikipedia article on 'kernel density estimation' and perhaps some of its references--especially, those by B. Silverman.)

Here is the same histogram as above. The default KDE from R is shown as a dotted curve. Tick marks along the horizontal axis show positions of the 500 observations.

hdr2 = "n = 500: KDE (dotted) and GAMMA(6, .1) PDF"
hist(x, prob=T, ylim=c(0,.02), col="skyblue2", main=hdr2); rug(x)
 curve(dgamma(x, 6, .1), add=T, col="brown")
 lines(density(x), lty="dotted", lwd=2) 

enter image description here

The KDE is computed without regard to the histogram; if I had chosen different cutpoints for the histogram bars, the KDE would be the same. [Below, parameter br=25 was used to 'suggest' using more bars--perhaps too many.]

enter image description here

Empirical CDFs (ECDFs). The empirical CDF of a sample is made by sorting the sample of size $n$ from smallest to largest, starting at $0$ on the left, the ECDF increases by $1/n$ at each observation, reaching $1$ at the right.

Often an ECDFs gives a more accurate view of the CDF of the population than a histogram gives of the density function (because ECDFs do not rely on arbitrary binning). Below the ECDF of x is compared with the CDF of $\mathsf{Gamma}(6,.1).$

plot(ecdf(x), col="skyblue2")
 curve(pgamma(x, 6, .1), add=T, col="brown", lwd=2)

enter image description here

Note: In R, a non-plotted histogram is a list of details of the numbers used to make a histgram, some of which are copied below:

 hist(x, prob=T, plot=F)

$breaks
[1]   0  20  40  60  80 100 120 140 160

$counts
[1]  10 109 169 121  62  19   6   4

$density
[1] 0.0010 0.0109 0.0169 0.0121 0.0062 0.0019 0.0006 0.0004

$mids
[1]  10  30  50  70  90 110 130 150