How to calculate error margin on measured number of occurd events when sampling

120 Views Asked by At

We are measuring the number of times a certain event happens. We do this with the help of sampling, so that we only report events with a probability p. For example p=0.01 would result in about 1/100 events coming through. Now, if we have measured N events, we will assume that there have for real been N/p events.

We would be interested to know what the error margin is based on p and N. We would like to be abled to say that with 95% probability our result is within some margin of error. How to calculate this? We are looking for a function which takes N and p as parameters.

Thanks for your help in advance!

1

There are 1 best solutions below

6
On

I think this question fits more into Crossvalidated or Statistics part of SE, but here is how I would deal with this issue:

Number of events, or count data, is commonly modeled with Poisson distribution, which takes just one parameter - $\lambda$ (lambda). Poisson distribution's mean and variance equals to this $\lambda$.

So, for example, if you measured this N for several periods, and have, e.g. a vector of Ns [1000, 1020, 1040, 970, 900]; In R, to get $\lambda$ given such a vector, you would do:

library(MASS)
Ns <- c(1000, 1020, 1040, 970, 900)
params <- fitdistr(Ns, "poisson") # fit poisson distribution
lambda <- params$estimate # get mean 
sd <- params$sd # get standard deviation

Then, if your sample is large, you can get 95% confidence interval's lower and higher boundaries by multiplying standard deviation with 1.96 and subtracting and adding it from mean respectively.

ci <- c(lambda + c(-1,1) * 1.96 * sd)

Edited: forgot to mention, now that you have 95% CI, you can answer your question by checking if the value you are interested in is within this 95% CI or not.

As for the sampling rate, I'm not sure if it really matters, if you know for sure that you sample every 100th value - just multiply the values in Ns by 100.

Edited: Maybe this helps also: https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval

p <- 0.01 # expected proportion of logged events
N <- 600 # number of logged events
Np = N/p # expected number of events (sample size)
# For each event we have a Bernoulli trial, if it gives 1 - we record, if 0 - we ignore.
# Then N is the total heads.
# We can get 95%  confidence intervals for proportion of logged events p:  
ci <- c(p + c(-1,1) * 1.96 * sqrt((1/Np)*p*(1-p)))

Or given your N and p, what would be the uncertainty about the p.