What are the possible statistical/mathematical approaches to address the below question?
Assume we have 40millions records and X of them may contain a specific word (e.g.,” dance”). We randomly select 25millons records from the 40millions (a random sampling without replacement). What is the probability of appearing some or all of the X records in the new population? OR how small would X get when it comes to the smaller group of 25m? (what proportion of X would appear in the new population n?)
Let the number of records be $N$, $X$ of them contain the key word: "dance". The probability of choosing an element with the key word is $\approx\frac X N.$ If we randomly select $M$ records then we can model the result with a binomial distribution of parameters $\frac X N$ and $M$. So the probability that the key element appears exactly $n$ times is
$$\approx{M\choose n}\left(\frac X N\right)^n\left(1-\frac X N\right)^{M-n}.$$
The probability that we get all of the key records is
$$\approx {M\choose X}\left(\frac X N\right)^X\left(1-\frac X N\right)^{{25 \ 000\ 000}-X}=$$ $$={{25\ 000\ 000}\choose X}\left(\frac X {50\ 000\ 000}\right)^X\left(1-\frac X {50\ 000\ 000}\right)^{M-X}.$$
If $X=833492$ the the probability sought for is
$$\approx{{25\ 000\ 000}\choose 833492}\left(\frac {833492} {50\ 000\ 000}\right)^{833492}\left(1-\frac {833\ 492}{50\ 000\ 000}\right)^{25\ 000\ 000-{833492}}=$$ $$=1.53079381000316881862094682562175963748630038461450 × 10^{-71462},$$ quite small.
Otherwise, by common sense, the expected number of key records in $25000000$ samples $$\frac X{40000000}\times{25000000}.$$
If $X=833792$ then the same is
$$\frac {833492}{40000000}\times{25000000}=520932.5$$