Average number of aggregations in the dataset

43 Views Asked by At

I am dealing with numerical codes / sensible data. Part of the data is being blacked because of security reasons.

To simplify the problem, imagine a code of 4 digits [_ _ _ _], e.g. [1 4 3 9] etc., where the first two digits are being blacked, such that they can't be recovered anymore.

Example: I am seeing the code [* * 5 4], that might represent any of the 100 different possibilities, i.e. [1 2 5 4], [7 4 5 4], [2 2 5 4] etc. The blackening introduces aggregations into the data, e.g. the 2 codes [7 4 5 4], [2 2 5 4] are being aggregated into one [* * 5 4].

Assuming a random distribution of the digits 0-9 over all 4 places and having a set of k = 60 distinct samples of the form [**__], I would like to tell the average number of aggregations in this data.

Edit: As one aggregation I am considering two or more 4-digit codes [_ _ _ _] that have been aggregated into one 2-digit code[* * _ _].

1

There are 1 best solutions below

0
On

The problem is equivalent to the following one:

Let us fill randomly $n$ bins with $m$ balls and ask what is the probability that exactly $k$ bins are filled with at least 2 balls.

The solution of the problem for $k=n$ can be found here, where also a formula for computing the number of ways to partition a set of $m$ objects into $k$ subsets so that each subset contains at least 2 objects (aka 2-associated Stirling number of second kind) is given: $$S_2(m,k)=\sum_i (-1)^i \binom{m}{m-i} { m-i \brace k - i},\tag1$$ where $$ { m \brace k}\equiv S_1(m,k) $$ are the "usual" Stirling number of second kind.

In our problem along with the bins containing at least 2 balls there are additionally $n-k$ bins containing 0 or 1 ball. Taking this into account the overall number of ways to have exactly $k$ bins with 2 or more balls is: $$ N_k(n,m)=n!\sum_{l=0}^{n-k}\binom ml\frac{S_2(m-l,k)}{(n-k-l)!}\tag2 $$ where the factor $\dfrac{n!}{(n-k-l)!}$ counts the number of all possible permutations of the bins with account for the fact that empty bins are indistinguishable, and the factor $\binom ml$ counts the number of ways to choose $l$ balls going to bins filled with one ball.

Finally the probability to have $k$ "aggregations" and the expected value of the "aggregations" are given by $$ p_k(n,m)=\frac{N_k(n,m)}{n^m},\quad \bar{k}(n,m)=\sum_k k p_k(n,m), $$ respectively.

For your example one obtains: $$\bar{k}(100,60)\approx12.1233.$$