Randomic distribution

126 Views Asked by At

Suppose I have X elements [A,B,C,D,...] and Y subsets of X of max length 4.

Suppose the subsets are build up randomly, is there a way to compute the average number of subsets an item of X is an element of ?

EDIT:

I have a file like this: where X are the numbers between 0 and 99
with 100 000 rows, which represent the subsets and I get that singletons are present in approximately 2500 rows
i.e. E(W) = 100000 * 0.025
while according to the formula it should be : E(W) = 1000000 * 0,040

82 73 62 
21 49 11 
28 
54 5 10 
42 
31 35 
89 5 
42 88 9 
62 
15 77 72 83 
33 83 
0 
73 88 
84 10 
61 98 78 
22 13 6 
10 66 
76 34 8 0 
92 39 
52 97 14 
95 7 46 6 
79 30 
74 29 73 49 
23 83 
63 
99 18 68 10 
59 64 8 52 
21 85 

code snippet:

        int j = 0; // row counter
        while(j < 100000)
        {
            List<String> list = new ArrayList<String>();
            int max = r.nextInt(4) + 1; // gives me an int between 1 and 4
            for(int jj = 0;jj <max;jj++) // cycle to generate subset
            {
                int rand;

                // produce random number between 0 and 99
                do
                {   
                 rand = r.nextInt(100);
                }
                while(!notPresent(""+rand,list)); // retry if it's been already inserted in the list  

                list.add(""+rand);
                writer.print(rand+" "); // write set element
            }

     writer.println();  // new line character
     j++;   
    }
1

There are 1 best solutions below

18
On BEST ANSWER

We change notation a little. Let $A$ be an $N$-element set. We choose at random a multiset $C$ of $m$ subsets of $A$, each of cardinality $\le 4$, with all choices equally likely.

Let $a$ be a specific element of $A$, and let random variable $W$ be the number of sets in $C$ that contain $a$. Let $U_i=1$ if on the $i$-th choice of a set, our chosen set contained $a$, and let $U_i=0$ otherwise. Then $W=U_1+\cdots+U_m$. By the linearity of expectation we have $$E(W)=E(U_1)+E(U_2)+\cdots +E(U_m)=mE(U_1).$$

So now all we need is to find $E(U_1)$, that is, the probability that $U_1=1$.

There are $\binom{N}{4}$ possible sets of $4$, $\binom{N}{3}$ sets of $3$, and so on, down to the empty set if you wish, but I will (arbitrarily) not allow that as a choice.

The probability than a $4$-set contains $a$ is $4/N$, and the probability that a $3$-set contains $a$ is $3/N$, and so on. Thus $$E(U_1)=\frac{(4/N)\binom{N}{4}+(3/N)\binom{N}{3}+(2/N)\binom{N}{2}+(1/N)\binom{N}{1}}{\binom{N}{4}+\binom{N}{3}+\binom{N}{2}+\binom{N}{1}}.$$ The numerator simplifies to $\binom{N}{3}+\binom{N}{2}+\binom{N}{1}+1$. Putting things together we find that $$E(W)=m\cdot \frac{ \binom{N}{3}+\binom{N}{2}+\binom{N}{1}+1}{\binom{N}{4}+\binom{N}{3}+\binom{N}{2}+\binom{N}{1}}.$$

Remark: You might wish to change the model in various ways. The basic technique detailed above probably can, with small changes, still be used.

In particular, it may be that for your application you are interested not in subsets, but rather in sequences of length $\le 4$. The basic analysis will be much the same, with small differences of detail, powers instead of binomial coefficients.

Added: If the size of the set is randomly chosen to be between $1$ and $4$, then $E(U_1)$ changes. We use a conditional expectation argument.

If the size is $1$, then the expectation is $\frac{1}{N}$, if the size is $2$ the expectation is $\frac{2}{N}$ and so on, so $$E(U_1)=\frac{10}{4N}.$$