Modelling uncertainties in sampling

28 Views Asked by At

I am seeking to complete/perfect a simulation program in Excel that models the following:

Step 1

Sample N items where the probability p of any item being in one of two states. It is handy to depict these two states with a 1 and a 0, respectively, so one can use Boolean algebra. P(1)=p. P(0) =1-p

So far so good: I have coded for each successive sample (1 to N) in my simulation with:

=IF(RAND()<”p”,1,0)

and that produces results with means and standard deviations that conform very closely to what binomial theory predicts

$σ =(p(1-p)/N)^{0.5}$

as I vary values for p in the range 0% to 100%.

I know, that < should be ≤ , but the difference is inconsequential with Excel generating random numbers between 0 to 1 to 10 decimal places.

Step 2

Model the uncertainty in my example that categorizations of all the above sampled items as a 1 or a 0 cannot be made with 100% reliability. The probability that the sampler can correctly identify whether the sampled item is actually a 1 is only c (the probability of mis-categorization to a 0 is thus 1-c), and the probability that the sampler can correctly identify that a sampled item is actually a 0 is also c (the probability of mis-categorization to a 1 is thus also 1-c).

My overall objective is to determine the overall standard deviation of possible outcomes with variability due both to sampling, and due to uncertainties in assessing the sampled items.

=IF(RAND()<c,“sampled”,BITXOR(“sampled”,1)

I’m using Excel’s BITXOR function with the dummy variable fixed at 1 to synthesise the logical NOT function because Excel does not have a Boolean NOT function.

Results

With hundreds and thousands of samplings and determinations in my simulations, the variances are also plausible for the uncertain determinations (i.e. judgments whether 1 or 0 is applicable to the sampled items). However, in every simulation run, with various different levels of p and c as inputs, the determined average value of p (let’s call it p’) is closer to 50% than was p. This does not make sense to me. While I can appreciate that some adjustment in outcome must occur to reflect the imprecise determinations, my model’s skew seems to lack a symmetry I would expect.

While I have described the status of 1 or 0 as being “actualities” versus imprecise determinations/judgments in the above, the status of 1 or 0 could just as easily be an imprecise perception in both cases (i.e. as if the truth cannot ever be know with certainty and we are comparing opinions/judgements), with my model modelling the extent of agreement, as well as (mean) values for p and p’. Under these latter circumstances, I can’t see why the mean value of p’ should necessarily be closer to 50% than the mean value for p.

Am I right or what am I missing? Is there more than one mathematical solution to my problem that I am not including in my simulation, or is there something else wrong in my modelling.