I'm working on an experimental consensus algorithm for a blockchain oracle service (trying to learn about it at least) and I have a simple question whether a certain property is even calculatable or not. I'm not thaaat math savy actually but sometimes I think I am. :D
So... we have lots of smarties, red and green ones. :) Around 49% of them are red and 51% are green. I wanna pick 3 with the same color in a row. If I pick one of a different color I eat them all and try again. So there's a certain probability of achieving that but what I wanna know is... is there a way to mathematically approximate the number of smarties I will need to eat until I succeed? For each color separately?
I have a simulation of that and it shows me an average value of how many "smarties" it ate.
Corrupted oracles: 33% (3300/10000)
Confirmations needed per request: 10
---
Data Requests: 1000
Involved oracles per request (avg): 188.49 (x18.85 of minimum)
Correct consensus: 999 (99.90%)
False consensus: 1 (0.10%)
No consensus: 0 (0.00%)
The corrupted oracles are our red smarties here. The simulation assumes 33% out of 10k overall. It tries to pick 10 "same colored smarties" in a row and tries that 1000 times. In this example it needed to eat on average 188.49 smarties until it had eaten 10 same colored ones in a row.
Question: Can this number somehow be calculated based on for example 33% red and 10 same ones in a row?
Additional question: Can I calculate the number of corrupted oracles/red smarties with that number?
While I hope for enlightening answers I'll edit this and explain a bit more...
Ok so... As I said, I am playing a round with blockchain oracles. Just wanna learn about it. But doesn't matter anyway. Basically I want to amplify a given distribution, so that chances of hitting the sliiightly bigger chunk become much higher than for example 49/51. I'm doing that by simply chaining the probabilities and I realized that this changes the relation. And actually I tested my script with a 49% corruption value and et voilà:
Corrupted oracles: 49% (4900/10000)
Confirmations needed per request: 10
---
Data Requests: 1000
Involved oracles per request (avg): 1496.85 (x149.69 of minimum)
Correct consensus: 609 (60.90%)
False consensus: 390 (39.00%)
No consensus: 1 (0.10%)
Still 60% correct consensus. BUT also 40% false consensus. We accidentally ate 10 red smarties in a row. So given the fact that when all smarties are eaten the result is simply "No consensus" (safe failure), there must be a way to (statistically) force the smaller side into "No consensus" by setting the right amount of maximum involved oracles (smarties).
Does that make sense or am I going in circles here?
Simulation script can be found here: https://github.com/mktcode/simplor-node
Update
Corrupted oracles: 33% (3300/10000)
Confirmations needed per request: 10
---
Data Requests: 1000
Involved oracles per request (good): 189.51 (x18.95 of minimum)
Involved oracles per request (bad): 898.00 (x89.80 of minimum)
Correct consensus: 999 (99.90%)
False consensus: 1 (0.10%)
No consensus: 0 (0.00%)
I split up "Involved oracles per request" now into good and bad ones. So here you can clearly see, that the 1 false consensus needed 898 oracles while all the correct ones only needed 189.51 on average. That is the very significant difference that results from chaining the probabilities, even if the network is 49% corrupt and I'm wondering if that might be an interesting thing.
I started out making it a comment, but it got too long.
Suppose that the probability that all the smarties is $p$. If you put the smarties back when they're not all the same color, the then expected number of trials until you get a sample that's all the same color is $\frac1p$. (This is what I meant by sampling with replacement and a geometric distribution.)
If you don't put them back, then calculating the exact expectation is laborious, and I would recommend simulation instead. However, it may be that the simple calculation described in the first paragraph is good enough for your needs. Suppose, in the initial population, half are green smarties and half are are red. If you randomly choose $10$ smarties, the probability that they are all the same color is $\frac1{512}$, and it will take, on average, $512$ trials to get a monochrome sample. If you were sampling without replacement, you would have removed $5120$ smarties from the jar.
If you started out with $20,000$ smarties, removing $5,000$ of them might have a material effect on the distribution of colors in the population. If you started out with a million smarties, not so much. This is what I was getting at in my comment. If your population is large, you may be able to get an answer that is close enough for your purposes by ignoring the effect of removing the smarties, and assuming instead that they are replaced.
I'm not a javascript programmer, so I'm not sure I understand your code, but in fact I do not see where your simulation is removing the sampled items.