I am trying to work out what the probability that $n$ coin toss sequences of length $k$ match in $m$ spots. The simplest example would be two sequences of a given length. If I toss a coin 100 times, twice, how many of positions in the sequence will match? My intuition tells me that if I have a fair coin (so $p = 0.5$) then I should expect two random sequences to on average, agree in 50% of their locations. I also think that I am correct in saying that the probability of two sequences of length $k$ being in complete agreement is $p^q (1-p)^{(k-q)}$ where $q$ is the number of heads in the sequence.
However, I am unable to formalize my particular question (which is a fraction of the sequence, independent of subsequence length), nor am I able to extend this to the probability that 3 sequences of a particular length agreeing in 50% of their locations. Any help would be appreciated.
Edit: To be more specific, I am having difficulty with the combinatorics involved when the probability $p \neq 0.5$. Suppose the first sequence contains 2 heads. It can be $HHTT$ or $HTHT$ or $THTH$ etc.. it doesn't matter. Now I want to calculate the probability that another sequence matches $0, 1, 2, 3, \mathrm{or}\ 4$ of those outcomes. I have chosen $HHTT$ as the first sequence and enumerated the possible matching sequence for each case.
In the case of 0 matching outcomes, the only possibility is
\begin{array}{cccc} & T & T & H & H \\ \end{array}
In the case of 1 matching outcome, the possibilities are:
\begin{array}{cccc} & H & T & H & H \\ & T & H & H & H \\ & T & T & T & H \\ & T & T & H & T \\ \end{array}
In the case of 2 matching outcomes, the possibilities are:
\begin{array}{cccc} & H & H & H & H \\ & H & T & T & H \\ & H & T & H & T \\ & T & H & T & H \\ & T & H & H & T \\ & T & T & T & T \\ \end{array}
In the case of 3 matching outcomes, the possibilities are:
\begin{array}{cccc} & H & H & T & H \\ & H & T & T & T \\ & H & H & H & T \\ & T & H & T & T \\ \end{array}
Finally, for all 4 matching outcomes, the only possibility is:
\begin{array}{cccc} & H & H & T & T \\ \end{array}
Now the number of combinations for matching $m$ outcomes for a sequence of length $k$ is given by ${k\choose m}$. The only problem is, because the probability is not $0.5$, I need to enumerate how many of those contain $x$ number of $H$, so I can sum up $p^q (1-p)^{(k-q)}$, each multiplied by the appropriate prefactor. However, I am unable to identify the appropriate pattern. This is the main difficulty I am having.
Any help towards figuring out a formula for the probability of two (or more!) sequences of length $k$ matching in $m$ spots, where the probability of each outcome is not 0.5 would be greatly appreciated.
Thanks!
We are going to imagine that the set of all our rolls form a grid.
So if I rolled HTTH, TTTH, HHHH. The grid created would be:
Now when we are looking for match we can just look at each column of our grid individually. If every element in the column is the same we have a match.
Now we can rephrase our problem in terms of our grid
We want to calculate the probability of getting:
We can know view our problem as a simple case of binomial probability with each of our columns being a trial.
There are $\binom{k}{m}$ ways we can select the $m$ columns we want as our matching columns.
The probability of getting a matching column is $p^n+(1-p)^n$ since we can have all heads or all tails in our column.
And of course the probability of not having a match in a column is $1-(p^n+(1-p)^n)$
This leaves us with
$$\binom{k}{m}(p^n+(1-p)^n)^m(1-p^n-(1-p)^n)^{k-m}$$
As our final probability.
I also wrote a program in java to verify this and the formula seems to be correct you can look at the code yourself here.