The formula for the expected value in a hypergeometric distribution is: $$ E(X) = {ns\over N} $$ where
- $n$ is the number of samples,
- $s$ is the number of successes and
- $N$ is the population size.
The way we are supposed to see $X$ is: $$ X = I_1 + I_2 + I_3 + \ldots + I_n $$ where each $I_j$ equals $1$ if the $j$-th sample is a success.
Clearly, each $I_j$ is dependent on the previous trials. However, using linearity of expectation, we are supposed to believe that we only need the unconditional probabilities of each trial.
I have two questions:
What does it exactly mean for $I_j$ to be the success of $j$-th sample? Are they talking about first taking n samples, and then checking the jth sample among them (because if so, the entire thing makes sense, refer question $2$)? Or are they talking about the event of sampling jth time after $j-1$ samples have already been taken (if so, taking unconditional probabilities doesn't make sense)?
How do you simply assume the unconditional probability of $I_j$ to be $s/N$, despite the fact that each trial itself has been encoded to be dependent on the previous trials
When you say that there are $s$ "good" objects in a pool of $N$ objects from which you are going to draw $n$ objects without replacement, the hypergeometric distribution is what we know, before drawing any objects, about the probability that any given number of the "good" objects will end up in the set of objects that we eventually draw. So the hypergeometric distribution is fundamentally all about probabilities that are not conditioned on any partial results of drawing some of the $n$ objects from the pool.
I find it helpful to consider each object in the pool as a unique individual that either will or will not be among the $n$ objects drawn from the pool. As a simple concrete example, put $N = 3$ balls labeled A, B, and C in a bag, mix the contents of the bag thoroughly, and then draw all three balls one at a time. Consider drawing ball A to be a "success" and drawing any other ball to be a "failure," so $s = 1$. We will have exactly one "success" in the three one-ball samples, but it is equally likely to occur on the first, second, or third trial, each with probability $s/N = 1/3$.
For another concrete example, consider a standard deck of $52$ cards. The deck has been thoroughly shuffled and placed on a table with all cards face down. Suppose that drawing a heart is a "success" while drawing any other card is a "failure".
If we just turn over the top card of the deck, it has probability $1/4$ of being a heart (since there are $s = 13$ hearts in a deck of $N = 52$ cards). If we count nine cards off the top of the deck without revealing their faces and then turn over the tenth card, again it has probability $1/4$ of being a heart. Likewise if we turned over the third card instead of the tenth, or the fifth card instead of any of the others.
Those $1/4$ probabilities are all unconditional probabilities that the card at some position in the deck will be a heart.
Now if we take the top $n = 20$ cards from the deck, before we look at any of those cards, each of the $20$ cards has an $s/N = 1/4$ unconditional probability of being a heart. The events of "success" for each card are not independent, of course, but their unconditional probabilities are still all equal to $s/N$.
Note, however, that although these events have equal probability, they are not independent. For example, it is not possible for all $20$ cards to be hearts.
The expectation of a hypergeometric distribution, $E(X)$, is not conditioned on any partial set of observations. It's an unconditional expectation. In order to compute that expected value as a sum of expected values via linearity of expectation, we need all of those expected values to be unconditional.
The part that I think is hardest to understand intuitively is that when we apply linearity of expectation, it does not matter whether the individual "success" events are independent. It only matters that we know the unconditional probability of a "success" on each trial, and therefore the expected value of $I_k$ for each $k$.
So if you understand linearity of expectation, the computation of $E(X)$ is simple. If you do not truly understand linearity of expectation, then you may need to work on that understanding in order to make sense of $E(X)$.