Probability that n sequences agree in a certain percentage of them

180 Views Asked by At

I am trying to work out what the probability that $n$ coin toss sequences of length $k$ match in $m$ spots. The simplest example would be two sequences of a given length. If I toss a coin 100 times, twice, how many of positions in the sequence will match? My intuition tells me that if I have a fair coin (so $p = 0.5$) then I should expect two random sequences to on average, agree in 50% of their locations. I also think that I am correct in saying that the probability of two sequences of length $k$ being in complete agreement is $p^q (1-p)^{(k-q)}$ where $q$ is the number of heads in the sequence.

However, I am unable to formalize my particular question (which is a fraction of the sequence, independent of subsequence length), nor am I able to extend this to the probability that 3 sequences of a particular length agreeing in 50% of their locations. Any help would be appreciated.

Edit: To be more specific, I am having difficulty with the combinatorics involved when the probability $p \neq 0.5$. Suppose the first sequence contains 2 heads. It can be $HHTT$ or $HTHT$ or $THTH$ etc.. it doesn't matter. Now I want to calculate the probability that another sequence matches $0, 1, 2, 3, \mathrm{or}\ 4$ of those outcomes. I have chosen $HHTT$ as the first sequence and enumerated the possible matching sequence for each case.

In the case of 0 matching outcomes, the only possibility is

\begin{array}{cccc} & T & T & H & H \\ \end{array}

In the case of 1 matching outcome, the possibilities are:

\begin{array}{cccc} & H & T & H & H \\ & T & H & H & H \\ & T & T & T & H \\ & T & T & H & T \\ \end{array}

In the case of 2 matching outcomes, the possibilities are:

\begin{array}{cccc} & H & H & H & H \\ & H & T & T & H \\ & H & T & H & T \\ & T & H & T & H \\ & T & H & H & T \\ & T & T & T & T \\ \end{array}

In the case of 3 matching outcomes, the possibilities are:

\begin{array}{cccc} & H & H & T & H \\ & H & T & T & T \\ & H & H & H & T \\ & T & H & T & T \\ \end{array}

Finally, for all 4 matching outcomes, the only possibility is:

\begin{array}{cccc} & H & H & T & T \\ \end{array}

Now the number of combinations for matching $m$ outcomes for a sequence of length $k$ is given by ${k\choose m}$. The only problem is, because the probability is not $0.5$, I need to enumerate how many of those contain $x$ number of $H$, so I can sum up $p^q (1-p)^{(k-q)}$, each multiplied by the appropriate prefactor. However, I am unable to identify the appropriate pattern. This is the main difficulty I am having.

Any help towards figuring out a formula for the probability of two (or more!) sequences of length $k$ matching in $m$ spots, where the probability of each outcome is not 0.5 would be greatly appreciated.

Thanks!

2

There are 2 best solutions below

0
On BEST ANSWER

We are going to imagine that the set of all our rolls form a grid.

So if I rolled HTTH, TTTH, HHHH. The grid created would be:

HTTH  
TTTH
HHHH

Now when we are looking for match we can just look at each column of our grid individually. If every element in the column is the same we have a match.

Now we can rephrase our problem in terms of our grid

We want to calculate the probability of getting:

  • $m$ matching columns
  • In a grid with $k$ columns and $n$ rows
  • With $p$ being the probability any element is a heads

We can know view our problem as a simple case of binomial probability with each of our columns being a trial.

There are $\binom{k}{m}$ ways we can select the $m$ columns we want as our matching columns.

The probability of getting a matching column is $p^n+(1-p)^n$ since we can have all heads or all tails in our column.

And of course the probability of not having a match in a column is $1-(p^n+(1-p)^n)$

This leaves us with

$$\binom{k}{m}(p^n+(1-p)^n)^m(1-p^n-(1-p)^n)^{k-m}$$

As our final probability.

I also wrote a program in java to verify this and the formula seems to be correct you can look at the code yourself here.

6
On

When you are matching two sequences of a fair coin it doesn't matter what the first one is. It might as well all be heads. Each toss matches with chance $0.5$ The chance of perfect agreement is just $\frac 1{2^n}$ where $n$ is the total number of tosses. Similarly, the chance of matching in $m$ out of $n$ positions is ${n \choose m}\frac 1{2^n}$. You are correct that this is maximized at $50\%$ matches. For reasonably large $n$ the distribution of matches is approximately normal with mean $0.5n$ and standard deviation $\sqrt {\frac n4}$

When you talks of three sequences agreeing, assuming you mean they all have a common value at a given position, the chance is $25\%$. The first one can be anything and the others each have $50\%$ chance to match it. You would therefore expect triple matches in $\frac 14$ of the locations.