Genomes consist of 4 nucleobases (A, C, G, T). If we take genome substrings of length $k$ (called $k$-mers), we have $4^k$ theoretically possible substrings. If we have a genome of length $l$ (meaning $l$ nucleobases), we will have a total of $l - (k - 1)$ substrings. If we now randomly take $l - (k - 1)$ substrings from these $4^k$ theoretically possible substrings, what is the expected value for?:
- the number of unique substrings (occur exactly once)
- the number of distinct substrings (occur once or more)
- the number of recurring substrings (occur at least twice)
To simplify the question: If you have $n$ numbers and take $m$ numbers from these ($n>m$; numbers can be chosen several times), what is the expected value / amount of numbers from these $m$ chosen numbers that occur only once/once or more/at least twice.
Thank you!