Monte Carlo estimator of the number of 1's in a very long binary sequence

320 Views Asked by At

Preface

The question below is related to a problem I am working on, which requires counting the number of times a logic-valued function evaluates "TRUE" given an input value. The size of my input set is HUGE and evaluating the function is rather costly. However, I can think of this statistically by viewing the set of values generated by applying the function to each input value as a population. I would like to study this population statistically by evaluating a (relatively) small subset of the input values and using that information to infer a plausible range for the total number of times it would have evaluated true had I actually evaluated each input.

Therefore, I chose to conceptualize this as inferring the total number of 1's in a very long binary string via Monte Carlo. This conceptualization contains all the salient features of my problem.

Problem Statement

Given a long (but finite) binary string, $S$ with $n_S\gg 2^{20}$ I would like to use a Monte Carlo sampling approach to estimate the number of 1's in S.

Probabilistically, if I let each digit in S have an equal chance of being selected by the Monte Carlo sampler, then the digits in S can be thought of as a outcomes of a random variable. Therefor, the goal is to estimate $\mu = E[x\in S]$ via Monte Carlo. Since we know the size of S, our estimate of the number of 1's in S will be $\mu n_S$. However, there is no explicit formula for calculating $\mu$ apart from brute force enumeration.

For very large sets, enumeration is not possible and the basic/naive Monte Carlo approach is very inefficient. I was thinking that some form of adaptive stratification would be helpful, with adaptation based on the sample average of each strata.

Any suggestions or references to relevant literature would be helpful.

2

There are 2 best solutions below

10
On BEST ANSWER

I propose the following:

Let $T$ be the estimated time of computation, and assume we want a precision such as the variance is fixed to a certain value $V$.

The size of the input is $n_s$, let us plit it in $m$ parts with equal sizes $n'_S$.

On the $k^{th}$ subpopulation, estimate the proportion $p_k$ with a naive estimator condidering only the $MC^k$ first digits: $$ \hat{p_k}^{MC^k} =\frac1{MC^k} \sum_{a=1}^{MC^k} 1_{a^{th}\text{ digit = 1}} $$

You have an estimate of variance of $\hat{p_k}^{MC^k}$ given, assuming non correlation between 2 entries, by $$ V_k^{MC^k}=\frac 1{{MC}^k} \hat{p_k}^{MC^k}(1-\hat{p_k}^{MC^k}) $$

The whole estimator is $$\hat p(MC^1,\ldots,MC^m)= \frac 1{n_S} \sum n'_S\hat{p_k}^{MC^k} = \frac 1m\sum \hat{p_k}^{MC^k} $$

and an estimate of the variance is, considering that the subpopulations are uncorrelated: $$ V(MC^1,\ldots,MC^m)= \frac 1{m^2} \sum \frac 1{{MC}^k}\hat{p_k}^{MC^k}(1-\hat{p_k}^{MC^k}) $$

Now the optimization of $T = \sum MC^k$ under constrain $ V(MC^1,\ldots, MC^m) = V $ gives the heuristic solution $$MC^k\propto \sqrt{\hat{p_k}^{MC^k}(1-\hat{p_k}^{MC^k})} $$

So here is the algorithm you could try:

  • split your population
  • use a small fraction of your computation time to get a first estimate of each $V_k^{MC^k}$
  • decide to manage your computation time according to the previous equation

You should get the theorical variance

$$ V(MC^1,\ldots,MC^m)\propto \frac 1{m^2} \sum \sqrt{\hat{p_k}^{MC^k}(1-\hat{p_k}^{MC^k})} \\ \simeq \frac 1{m} \sqrt{p(1-p)} $$ which is a diminution of a factor $m$ compared to the variance of the naive estimator.

1
On

The following two up-to-date articles could also be interesting for you:

Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget: In this paper Markov chain Monte Carlo (MCMC) sampling is analysed by an adaptation of the Metropolis-Hastings (MH) algorithm reducing the computational budget. This is done by sequential hypothesis tests accepting or rejecting a sample based on small fractions of the data with high confidence.

An alternative approach is provided by An Adaptive Subsampling Approach for MCMC Inference in Large Datasets which is presented as more robust than the techniques above, but with the cost of increased computation.