Suppose I have X elements [A,B,C,D,...] and Y subsets of X of max length 4.
Suppose the subsets are build up randomly, is there a way to compute the average number of subsets an item of X is an element of ?
EDIT:
I have a file like this:
where X are the numbers between 0 and 99
with 100 000 rows, which represent the subsets
and I get that singletons are present in approximately 2500 rows
i.e. E(W) = 100000 * 0.025
while according to the formula it should be : E(W) = 1000000 * 0,040
82 73 62
21 49 11
28
54 5 10
42
31 35
89 5
42 88 9
62
15 77 72 83
33 83
0
73 88
84 10
61 98 78
22 13 6
10 66
76 34 8 0
92 39
52 97 14
95 7 46 6
79 30
74 29 73 49
23 83
63
99 18 68 10
59 64 8 52
21 85
code snippet:
int j = 0; // row counter
while(j < 100000)
{
List<String> list = new ArrayList<String>();
int max = r.nextInt(4) + 1; // gives me an int between 1 and 4
for(int jj = 0;jj <max;jj++) // cycle to generate subset
{
int rand;
// produce random number between 0 and 99
do
{
rand = r.nextInt(100);
}
while(!notPresent(""+rand,list)); // retry if it's been already inserted in the list
list.add(""+rand);
writer.print(rand+" "); // write set element
}
writer.println(); // new line character
j++;
}
We change notation a little. Let $A$ be an $N$-element set. We choose at random a multiset $C$ of $m$ subsets of $A$, each of cardinality $\le 4$, with all choices equally likely.
Let $a$ be a specific element of $A$, and let random variable $W$ be the number of sets in $C$ that contain $a$. Let $U_i=1$ if on the $i$-th choice of a set, our chosen set contained $a$, and let $U_i=0$ otherwise. Then $W=U_1+\cdots+U_m$. By the linearity of expectation we have $$E(W)=E(U_1)+E(U_2)+\cdots +E(U_m)=mE(U_1).$$
So now all we need is to find $E(U_1)$, that is, the probability that $U_1=1$.
There are $\binom{N}{4}$ possible sets of $4$, $\binom{N}{3}$ sets of $3$, and so on, down to the empty set if you wish, but I will (arbitrarily) not allow that as a choice.
The probability than a $4$-set contains $a$ is $4/N$, and the probability that a $3$-set contains $a$ is $3/N$, and so on. Thus $$E(U_1)=\frac{(4/N)\binom{N}{4}+(3/N)\binom{N}{3}+(2/N)\binom{N}{2}+(1/N)\binom{N}{1}}{\binom{N}{4}+\binom{N}{3}+\binom{N}{2}+\binom{N}{1}}.$$ The numerator simplifies to $\binom{N}{3}+\binom{N}{2}+\binom{N}{1}+1$. Putting things together we find that $$E(W)=m\cdot \frac{ \binom{N}{3}+\binom{N}{2}+\binom{N}{1}+1}{\binom{N}{4}+\binom{N}{3}+\binom{N}{2}+\binom{N}{1}}.$$
Remark: You might wish to change the model in various ways. The basic technique detailed above probably can, with small changes, still be used.
In particular, it may be that for your application you are interested not in subsets, but rather in sequences of length $\le 4$. The basic analysis will be much the same, with small differences of detail, powers instead of binomial coefficients.
Added: If the size of the set is randomly chosen to be between $1$ and $4$, then $E(U_1)$ changes. We use a conditional expectation argument.
If the size is $1$, then the expectation is $\frac{1}{N}$, if the size is $2$ the expectation is $\frac{2}{N}$ and so on, so $$E(U_1)=\frac{10}{4N}.$$