First just a note that this isn't a textbook problem but rather a practical problem I'm trying to solve in the real world (not involving marbles, but the problem is essentially the same). That's why it's ridiculously convoluted, sorry.
The situation: every once in a while someone comes to me with a humongous bag containing a huge number of marbles (e.g. 10 million) of different colors, and they ask me to wrap up all the color varieties except one in individual little paper wrappers (one wrapper per one marble). All they can tell me in advance is the total number of marbles in the bag, the list of colors in that bag (usually less than 10), and that they want me to get the wrapping correct for at least 98% of the marbles. It may not matter to the problem, but the color that's not getting wrapped is always much more common than the other colors. And to be clear there's only one kind of wrapper; I'm just mentioning that there are more marble colors than 2 because it might affect wrapping accuracy (e.g. due to color blindness or how similar the unwrapped color is to the other colors).
The tools I have to help me are a color-detecting marble wrapping machine and a handful of human employees. In the past I've found human employees to be roughly 95% accurate at wrapping the right marbles on average, but of course any given group of people may be better or worse. The machine I aim to calibrate to some level of accuracy. The plan is to tackle this project in three stages:
- Calibration: I take a subset, M, of the marbles and have the machine try to wrap them. Human employees "grade" the work, and I keep tweaking the machine and re-wrapping the M marbles until I can't afford the expense any more. After that work, on the given subset of marbles I've managed to make the machine X% accurate. X is always less than the 98% accuracy I need to deliver. After this calibration is done I set these M marbles aside as finished. First question: how big should M be as a function of the total number of marbles, so I can be, let's say above 90% sure the accuracy I'm seeing is representative of how the machine will perform with the total population of marbles?
- Machine Wrapping: I have the machine take a pass at wrapping the remainder of the 1 million marbles. Because I know from step 1 that the machine accuracy is lower than I need, I have my human employees take a pass at a subset, M2, of the machine-processed marbles, correcting any wrapping mistakes they see. The idea is that with the machine pass and the human pass I hit my 98% accuracy goal over all. Second question: how big should M2 be in order to be 90% confident that I hit the 98% accuracy goal, assuming the human employees are 95% accurate?
- Quality Assurance: As I mentioned above though humans aren't always reliable, or rather they are unreliable at different rates. To help verify for the customer that the employees involved this time weren't more inaccurate than expected I bring in a group of veteran employees to review a subset, M3, of all the marbles the other employees themselves reviewed. If it's easier we can assume the veterans are 100% accurate though of course that's not strictly speaking true. If it turns out the employees were less accurate than I was expecting I calculate an additional subset, M4, of the remaining machine-wrapped-only marbles for them to review so extra volume can make up the difference. Third question: how big should M3 be in order to be 90% confident that I've estimated their actual accuracy? Fourth question: how big should M4 be as a function of the employee's estimated accuracy to be 90% confident I've increased the total wrapping accuracy to the desired 98%?
To summarize: giant bag of marbles in a few different colors; I know the total marble count and the count of how many colors there are. I need to wrap all the colors but one with 98% accuracy overall. I have a machine that can mostly do the work, but, even after I tweak it, it's not quite accurate enough. My plan is to use human employees to make up the difference (and then other veteran employees to double check some of their work), but these folks also aren't perfectly accurate. What I'm looking for is a general purpose statistical algorithm/equation/plan (I won't say "the best" solution because that'd be an opinion) I can use to figure out the different marble samples to use above. Or, if there is a sampling plan with fewer steps (fewer sampling steps and/or fewer calculation steps) I'd love to hear that as well. Again hopefully it's okay for this site that I'm looking or "a" solution not "the best" solution. I could probably break this post into multiple different questions, but I think it's maybe easier to understand all together.