I have a large number (~1.5 million) of protein sequences, each of them of different lengths.There are 6 schematic examples in the attached image.
Within each of these sequences, there are >= 0 domains (shown as colored boxes). Each domain is a match to one amongst a collection of approx 15000 profile HMMs (D1 ... D15000). Domains cannot overlap. The sum of length of all domains identified in a protein <= length of entire protein. Some proteins may not possess a domain D type region in them.
Apart from these 15K 'D type domains', I have also detected matches to a different type of region in each protein, lets call it 'IDR type'. These IDRs are shown as red lines above the protein sequences. Some proteins may not possess an IDR type region in them.
The IDRs are of different lengths, but the sum of their lengths in one protein, cannot be longer than the length of the protein.The location of these IDRs can be anywhere between the start and end of the protein, the length of each individual IDR is also quite variable. Two IDRs will not overlap with each other.
Now I want to know how to formulate and statistically test my hypothesis that 1. the IDR(s) in each protein (across the entire collection of proteins) do NOT overlap AT ALL with domain D-type regions (i.e. red lines and colored boxed never overlap).
and / or
- the IDR(s) in each protein (across the entire collection of proteins) do NOT significantly intersect with domain D-type regions (i.e. red lines and colored boxed never overlap). In other words, is the frequency of intersect between D type regions and IDR type regions significantly lower or higher than expected just out of chance....and what is the statistical support for that assertion?
For each of the proteins in question, I know the start/stop locations of each of the D type matches, and the IDR type regions.But I have not processed this information to know, even casually, what the reality of these overlaps are.
In the attached image that I made up with imaginary data, the position of the IDRs (red lines) is completely non-intersecting with the domains (colored boxes). Which is not necessarily the case with the real data I have.
Please note that the position / location / coordinates of any "domain of D type" or "region of IDR type" is random.
Could the answer be something like what is explained at http://www.mathpages.com/home/kmath580/kmath580.htm. Or perhaps I am overthinking this?
Also, I suppose because I am testing the hypothesis on multiple proteins, there needs to be some sort of multiple testing correction?
Thank you in advance for your help.