Measuring the similarity of 2 subsets of $\mathbb{N}$ with the same upper bound.

30 Views Asked by At

The context is comparing 2 features in a DNA sequence (but the solution does not require any understanding of DNA, its features, or any kind of knowledge in biology).

For example, if a DNA sequence consists of 100 nucleotides (whatever a nucleotide might be), feature A might cover nucleotides [2-7, 17-32, 50-52, 54-57, 60-90] and feature B might cover [10-15, 49-56, 76-98]. Now, we can take each feature as simply a subset of [1-100], and forget that we are talking about DNA.

Finally, by similarity we mean any measure that might show some correlation between the features, perhaps indicating some kind of causation relation. The measure should tend to 0 when we take random subsets, and max out when the subsets are identical.