Expected Value for distinct substrings of length $k$

28 Views Asked by Bumbble Comm At 27 Mar 2026 - 5:04

Genomes consist of 4 nucleobases (A, C, G, T). If we take genome substrings of length $k$ (called $k$-mers), we have $4^k$ theoretically possible substrings. If we have a genome of length $l$ (meaning $l$ nucleobases), we will have a total of $l - (k - 1)$ substrings. If we now randomly take $l - (k - 1)$ substrings from these $4^k$ theoretically possible substrings, what is the expected value for?:

the number of unique substrings (occur exactly once)
the number of distinct substrings (occur once or more)
the number of recurring substrings (occur at least twice)

To simplify the question: If you have $n$ numbers and take $m$ numbers from these ($n>m$; numbers can be chosen several times), what is the expected value / amount of numbers from these $m$ chosen numbers that occur only once/once or more/at least twice.

Thank you!

Original Q&A

Expected Value for distinct substrings of length $k$

Related Questions in EXPECTED-VALUE

Related Questions in BIOLOGY

Trending Questions

Popular # Hahtags

Popular Questions