Uncertainty corresponding to bin population.

59 Views Asked by At

I have around 200 data points. I have separated these point into 7 bins. Some of these bins contain up to 50 points while some contain around 3-6 points. Some of these 200 points belong to a group C. I want to plot a histogram of the ratio of the number points in each bin that belong to C against the total number per bin. This is fairly simple.

I also want to plot the error related with bin population. I believe I can use the equation

$error=Z\sqrt{\frac{p(1-p)}{N}} $.

Where $p=N_C/N$

$N_c$ is the number of points in a bin that belong to C

$N$ is the total number per bin.

$Z=1$ for a 1 sigma significance.

However one of my bins has no points $N_c$ and therefore $P=0$ and $error=0$. This does not make sense, should there not still be uncertainty for the bin even if $N_C = 0$. Here is an example of my histogram Example Histogram. I have a feeling this equation makes some assumptions that do not apply for my circumstance. If you know of a more general form of this equation please let me know. I'm sorry if this is trivial, I am new to error analysis.

2

There are 2 best solutions below

1
On

The red error bars look impressive. However, what use do you expect readers will make of them? Error bars seem to encourage a kind of ad hoc analysis that can lead to false discovery, if not done carefully.

Maybe it's better to have a footnote, saying that heights of bars are subject to random sampling error, so that differences in heights may not be meaningful unless they exceed, roughly speaking, about ±0.2 or $\pm$0.3 on the vertical scale.

0
On

Second answer: In the first answer I suggested not using 'error bars' in your figure. Now it seems you want to use error bars and are focusing on the lack of an error bar for one category with a $0$ count.

When you get a $0$ count for something, there are at least two possibilities. (a) Counts are possible, but you didn't get one. (Fair coin tossed twice, but got no Heads.) (b) Counts "impossible". (Hypothesized J-particles don't exist. No extra-terrestrial students in this district.)

This is a philosophical issue not really answered by CI formulas. Suppose you get a 0 count in 100 trials:

A Wald 95% CI is $(0,0),$ [Note: Wald intervals are based on an asymptotic argument. Their use has been deprecated unless the sample size is at least several hundred.]

p.hat = 0/100
CI = p.hat + qnorm(c(.025,.975))*sqrt(p.hat*(1-p.hat)/100); CI
[1] 0 0

An Agresti 95% CI computes to $(-.007,.045)$ interpreted as $(0, 0.045),$

 p.est = (0+2)/(100+4)
 CI = p.est + qnorm(c(.025,.975))*sqrt(p.est*(1-p.est)/105); CI
 [1] -0.007037725  0.045499264

A Clopper-Pearson 95% CI, as implemented in the R procedure binom.test is $(0, 0.036).$ [The formula is somewhat intricate.]

binom.test(0, 100)$conf.int
[1] 0.00000000 0.03621669
attr(,"conf.level")
[1] 0.95

A Jeffries 95% CI is $(0.000005, 0.024745).$ [Based on a Bayesian argument, but also used as a frequentist CI.]

round(qbeta(c(.025,.975), .5,100.5),6)
[1] 0.000005 0.024745

Also, there are various methods for one-sided CIs that claim to give an upper bound. See in particular, the "rule of three".

See Wikipedia for more on confidence intervals for binomial proportions.