Expected Value for distinct substrings of length $k$

28 Views Asked by At

Genomes consist of 4 nucleobases (A, C, G, T). If we take genome substrings of length $k$ (called $k$-mers), we have $4^k$ theoretically possible substrings. If we have a genome of length $l$ (meaning $l$ nucleobases), we will have a total of $l - (k - 1)$ substrings. If we now randomly take $l - (k - 1)$ substrings from these $4^k$ theoretically possible substrings, what is the expected value for?:

  • the number of unique substrings (occur exactly once)
  • the number of distinct substrings (occur once or more)
  • the number of recurring substrings (occur at least twice)

To simplify the question: If you have $n$ numbers and take $m$ numbers from these ($n>m$; numbers can be chosen several times), what is the expected value / amount of numbers from these $m$ chosen numbers that occur only once/once or more/at least twice.

Thank you!