Approaching a desired but infeasible distribution when constructing a sample

53 Views Asked by At

Suppose you have $N$ balls in $C$ different colors, and a "desired" distribution of those $C$ colors (eg 20% red, 80% blue). Your task is to build a sample (not really a random sample per se) of $S$ balls, with colors distributed "as close as possible" to the desired distribution.

Bear in mind it is likely not possible to draw the desired distribution exactly, since there will be too few balls available of a particular color, and hence some other colors will need to be over-sampled to achieve the desired sample size ... but which should get over sampled and by how much? And what is a constructive process for creating the sample?

I assume the answer would depend on how you measure "as close as possible", but any reasonable definition will suit me, I'm mostly interested in a constructive algorithm for the process.

I understand I basically just need to determine a feasible distribution that is "as close as possible" to my desired (but possibly infeasible) distribution. My best guess is to choose one ball at a time, each time picking an available color that yields a posterior distribution of colors that is least-different from the desired distribution. This is at best inefficient and I'm not 100% sure it gives me an optimal solution anyway.

Sorry if this is a common statistical exercise, I did try to find this problem in the literature but knew not the proper vocabulary to facilitate my search.

I should mention that I consider this sampling, because I am asked to generate this sample for a tax audit, where "balls" are invoices, and "colors" are invoice types. The sample should be random, but end up with the specified percentage of each type where possible, and "as close as possible" where not possible.

1

There are 1 best solutions below

2
On BEST ANSWER

You can adapt the Sainte-Laguë proportional representation method to your problem.

Suppose you start with $n_i$ balls of colour $i$ so $\sum_i n_i=N$ and your target proportions are $p_i$ with $\sum_i p_i=1$.

For each $i$, calculate the quotients $\dfrac{p_i}{1}, \dfrac{p_i}{3}, \dfrac{p_i}{5}, \ldots ,\dfrac{p_i}{\min(2n_i-1,2S-1)}$.

Take the $S$ largest quotients from all those you have calculated (if there is equality at the end, then make an arbitrary choice such as those with the larger $p_i$) and count how many quotients came from each $i$, calling this count $s_i$, so $\sum_i s_i=S$. This $s_i$ gives your sample size for each $i$.

It has the property of minimising a kind of chi-square statistic: $\sum_i\dfrac{\left(p_i-\frac{s_i}{S}\right)^2}{p_i}$ within the constraints of the question.