Normalising Low Duty-Cycle Random Data

78 Views Asked by At

I'm an engineer by training and developing a quantum random number generator for high-security cryptographic applications.

I'm using single-photon arrival times as my source of entropy for the device. These are poissonianly distributed.

However, this means that my input data is in the form of around 30,000, 15ns pulses per second per channel. Obviously, if I just sample my input lines every 5ns say, I'd end up with random data with an extreme 0 bias.

Consequently, my thoughts are to keep a rolling average of the last X arrival times and output a 0 if the next event is less than the average and a 1 if it's greater than the average. Intuitively, I feel as though this should normalise the data and remove the duty cycle bias.

However, I have no idea how to go about proving this statistically.

Could you please point me in the right direction on either how to prove this or a "standard" way to perform this kind of normalisation (ideally something that can be easily implemented in digital circuitry)?

Failing this, I'll just have to collect a few gigabytes of data and count the 1s and 0s to see if there's much of a bias, I'd much rather have a proof though.

Note: An approach that mathematically demonstrates that this assumption is true to a reasonable degree of accuracy is perfectly fine if a formal proof is far too complicated.

1

There are 1 best solutions below

19
On BEST ANSWER

First about the median and the mean. Assuming that the arrivals form a true Poisson process (which is quite likely given your description but, perhaps, I miss some details), the intervals between the arrivals are independent random variables with the density $\lambda e^{-\lambda t}$ where $\lambda$ is the intensity of the process. The mean is exactly $1/\lambda$ and that is pretty much what you get averaging sufficiently many values (we can discuss how many are "sufficiently many" later if you get interested). However, the probability to be above the mean is $\lambda\int_{1/\lambda}^{+\infty}e^{-\lambda t}\,dt=1/e$, not $1/2$, so you'll have quite strong bias (almost $2:1$) in favor of short times if you use the mean for the cutoff.

The median (i.e., the cutoff at which you have equal probabilities to be above and below) is (from the same equation) $\frac{\log 2}{\lambda}$, i.e., about $0.693$ times the mean (so you can use the average to estimate it but you need to multiply by the factor above).

There are other ways, however, to extract independent zeroes and ones out of the Poisson process. One very simple way is to use 2 consecutive time intervals between arrivals and output $0$ if the first is longer and $1$ otherwise. The disadvantage is that you now use 2 arrivals to produce a single bit.

Another way is to look at consecutive time intervals of length $\frac{\log 2}{\lambda}$ and output $0$ if there is no arrival during the current time interval and $1$ if there is at least one arrival. That uses about $0.7$ arrivals per bit.

As before, you can use the average of several last observations to estimate $1/\lambda$ if you suspect that the intensity changes with time but believe that it does so slowly enough.

Edit: about the number of observations to estimate $1/\lambda$.

The standard deviation for the exponential distribution (the square root of the variance) is the same $1/\lambda$ as the mean. So, if you take the average of $X$ observations with not too small $X$, then it will approximately follow the normal law with the mean $1/\lambda$ (the one you want) and the standard deviation $\frac{1}{\sqrt X}\frac 1{\lambda}$. The rule of thumb is that it is quite typical to be outside one standard deviation from the mean (the exact chance is about $30\%$) but rather unlikely to be outside 3 standard deviations (the corresponding chance is about $0.3\%$). So, if you just take $X=1024$ samples once and use the resulting $\lambda$ for all observations (presuming that it does not change with time), then you may expect the value to be about $\frac 1{32}\approx 3\%$ wrong but not more than $9\%$ wrong and you'll have the same relative error for the median. Since the density of the exponential distribution near the median is about $1/2$ (which is quite convenient), this translates directly into the relative error in the probability to get $0$ or $1$, so you shouldn't get surprised if in this scheme you'll get the probability of $0$ and $1$ at $51.5\%$ and $48.5\%$ (or vice versa) instead of the expected $50-50$ ($1.5$ is $3\%$ of $50$) but getting one of them above $54.5\%$ and the other below $45.5\%$ on the very long run would mean that something is wrong either with the model or with the measurements.

If you re-evaluate $1/\lambda$ on the go (which makes sense if you suspect that $\lambda$ changes with time due to the temperature variations or other physical phenomena), then you'll be pretty close to $1/2$ on the long run (because positive errors will essentially cancel negative ones) but this $3\%$ bias will manifest itself locally (on every short stretch you'll have it in some direction), which can also be viewed as slight dependence between close bits (i.e., if you try to predict the next bit in some reasonable way (by the majority in the previous few trials, say, trying to rely on the fact that your coin at every moment is a bit biased though the bias changes with time), you'll be right a bit more often than not (I'll skip the exact computation but it looks like it gives the same $51-52\%$ range for correct guessing).

If this $3\%$ relative error is acceptable, then $1024$ samples are fine. If not, then just increase the number ($X$ samples give the standard deviation $1/\sqrt X$ times the mean, i.e., the relative error $1/\sqrt X$).

Again, always remember that the computations in any particular model are only as meaningful as the model itself, so always check against the experiment just in case we missed some phenomenon.