i'm doing a statistical analysis of a well-known cryptographic algorithm and have hit an anomaly. i need to prove that what i have found is statistically significant.
i am taking block sizes of 256 bits, and, assuming a binomial distribution (that occurrence of each bit being 0 or 1 is probability 0.5 and that all bits are independent) am adding up the number of zeros (which should on average be 128) and calculating a p-value. repeat for ten million blocks of 256 bits and a histogram can be built up.
the problem is that the p-values are discrete, not uniform, so the usual trick of creating say 10 buckets in increments of 0.1 does not work. also, the problem is that it is only the p-values below 1e-6 where the problem lies.
i have a table here which is ten million p-values from blocks of size 256 bits.
0.000000004 1
0.000000077 1
0.000000152 1
0.000000298 1
0.000000573 3
0.000001088 4
0.000002034 13
0.000003746 14
0.000006795 30
0.000012143 60
0.000021377 113
0.000037073 177
0.000063342 355
0.000106625 540
0.000176835 827
0.000288961 1300
0.000465258 2112
0.000738157 3321
0.001154050 5168
0.001778051 7472
0.002699796 11086
0.004040275 15609
0.005959526 22666
0.008664897 31501
0.012419331 44040
0.017548950 59411
0.024448945 79218
0.033586613 104171
0.045500264 133922
0.060792724 171872
0.080118314 216864
0.104162559 266307
0.133614403 324626
0.169131445 386622
0.211299547 457682
0.260589034 530572
0.317310508 604553
0.381573906 680090
0.453254705 754112
0.531971058 820481
0.617075077 880667
0.707660467 929469
0.802587349 965874
0.900523550 988681
1.000000000 498391
what i need to do is prove that one particular p-value histogram entry (which will represent the number of blocks with an exact number of 0s occurred) is a statistical anomaly.
i have another table of "reasonably known good" data, and there are no entries below 0.000000573. so i definitely "suspect" that something is wrong: now i need a probability of that "wrongness" occurring.
i thought about creating a "perfect" histogram (which should be feasible because it is a binomial distribution, you know the block size and the number of runs) then comparing against that, but i would not know how to do the comparison.
any help greatly appreciated.
l.
update: ok i have located a way to calculate a binomial distribution and have implemented it in c-code:
double binomial_pmf(int k, int n, long double p)
{
long double x = factorial(n) / (factorial(k) * factorial(n-k));
return x * powl(p, k) * powl(1-p, n-k);
}
now i should be able to create the "perfect histogram" by taking the function which calculates the p-value for each k from 1 to n-1 (n is the sample size), and multiplying by binomial_pmf times number-of-blocks (ten million).
so my question becomes: how do i make a meaningful comparison of these two histograms?