Probability of finding a sequence of integers in two-dimensional space

87 Views Asked by At

Suppose I have a matrix, size $m\times m$ for any odd value of $m \geq 3 $, filled with randomly selected integers where every integer is limited to between 1 and $c$, for $c \geq 2$. What is the probability I will find a specific sequence of integers, of length $k$, starting from the center? Furthermore, let us say that $m/2> k$ to avoid having to deal with boundaries. The only additional constraint is that a sequence cannot use the same matrix element more than once, for consider the following matrix:

$ M= \left[ {\begin{array}{ccccccc} 1 & 1 & 1 & 1 & 1 & 1 & 1\\ 1 & 1 & 2 & \color{pink}3 & 1 & 2 & 1\\ 1& \color{pink}3 & \color{blue}4 & \color{blue}4 & 2 & 2& 1 \\ 1 &1 & 2 & \color{red}3 & \color{pink}3 & 3 & 1\\ 1 &4 & 3 & 2 & 1 & 4 & 1\\ 1 & 2 & 1 & 4 & 2 & 1 & 1\\ 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ \end{array} } \right]. $

Let us say that in $M$ we are starting from the center (R4,C4) = $\color{red} 3$, looking for the sequence $\color{red}3,\color{blue}4,\color{pink}3$. In this case there are a total of 4 possible sequences, because we cannot count the center more than once for a given sequence.

I first found the probability for the 1D-case. In the 1D case, the probability is given by $P( c, k) = 1 - (1 - \frac{1}{c^K})$. This is only starting from one position (not over all possible positions).

For the 2D case, I will start at the center of the matrix. For a $3\times3$ matrix, the center is (R2, C2). Extending outward from the center in one direction (either S, SW, W, NW, N, NE, E, or SE) is equivalent to the 1D case, so we can obtain the probability by including a factor of 8 to account for a total of all the possible directions. So, for the case of my $3\times 3$ matrix, I think the probability is given by: $P(c, 2) = 8\cdot(1- (1-\frac{1}{c^2}) )$.

Now I am struggling to extend this out for a matrix that is sized $5\times 5$ for a sequence that has a length $k=3$. Using a Monte Carlo simulation with $10^9$ trials, I calculate the $P(c=12, k=3) \approx 0.066805$, but using $8\cdot(1- (1-\frac{1}{12^3}) )=0.0046$. I suspect that I will need to find an additional factor that depends on how the number of possible pathways grows for larger sequences. My intuition tells me that that should scale like $8, 16-1, 32-2, ...$, but I am stuck.

1

There are 1 best solutions below

0
On BEST ANSWER

Intuitively your idea for the probability in the $3 \times 3$ matrix might sound right, you essentially have 8 one-dimensional paths so it feels like you can multiply the probability by 8 to have the new probability. However this is incorrect and is one of the reasons that makes probability difficult, it doesn't always follow intuition.

To see an example of why your formula is wrong, consider $P(c=2,k=2)$. According to your calculation, the probability is $8\left( 1 - \left(1 - \frac{1}{2^2} \right) \right) = 2$. Probabilities can't be more than one, but your formula returned $2$! Consider another case, $P(c=4,k=2$), your computed probability is $.5$, which interpreted is: 50% of the time I generate a random $3 \times 3$ matrix, it will have the sequence that you desire given your constraint of starting in the middle. However, this can't possibly be true because the center node only has a $\frac{1}{c}$ chance of being the correct value, so the probability should not be larger than $\frac{1}{4}$.

The question I feel like you are asking is: Given a $3\times3$ matrix, what is the probability of finding At least one correct path. You don't particularly care if there are $2,3$ or $4$ paths, you just want existence. This means that you should actually be summing up all of the probabilities:

$$ P(\text{1 path exists}) + P(\text{2 paths exist}) + \cdots + P(\text{8 paths exist}) $$ in order to find the actual probability. Fortunately we can take advantage of the complement which says that this probability is equivalent to $1 - P(\text{0 paths exist}).$

There are only two ways that our path isn't going to exist when $k=2$:

  1. The center node is incorrect, which happens with probability $1 - \frac{1}{c}$.
  2. The center node is correct, and all of the 8 nodes surrounding the center are incorrect. This happens with probability $\left(\frac{1}{c} \right) \left( 1 - \frac{1}{c}\right)^8$.

This means that $P(\text{0 paths exist}) = \left[1 - \frac{1}{c}\right] + \left(\frac{1}{c} \right) \left( 1 - \frac{1}{c}\right)^8$, so we know that $$ P(\text{at least 1 path exists}) = 1 - \left[1 - \frac{1}{c}\right] -\left(\frac{1}{c} \right) \left( 1 - \frac{1}{c}\right)^8 \\ = \frac{1}{c} - \left(\frac{1}{c} \right) \left( 1 - \frac{1}{c}\right)^8 \\ = \frac{1}{c}\left(1 - \left(1 - \frac{1}{c} \right)^8 \right). $$

So the formula for the $3\times3$ matrix is $P(c,k) = \frac{1}{c}\left(1 - \left(1 - \frac{1}{c} \right)^8 \right)$.

Another derivation, and perhaps one that you will feel more comfortable with is by treating the 8 cells around the center as a Binomial distribution that has $n$ draws with $\frac{1}{c}$ as its probability, and the equation which is the probability of the center node being correct, multiplied by the probability of at least one success.

Unfortunately I haven't been able to extend this case to the $5\times5$ matrix because I see no obvious extension for how many paths we have available at the third element of the sequence.

  • At step 1, we are guaranteed to only have one node to consider, the center.
  • At step 2, we are guaranteed to only have 8 nodes to consider, each element surrounding the center.
  • At step 3, we have $Y$ nodes, which is dependent on both the number of nodes at step 2 that were correct, as well as WHERE those nodes are positioned.

To illustrate what I mean in step 3, consider an alteration of your original example of the $5\times5$ matrix (before your edit) in your question:

$ M= \left[ {\begin{array}{ccccc} 1 & \color{green}2 & \color{green}3 & 1 & 2 \\ \color{pink}3 & \color{blue}4_1 & \color{blue}4_2 & 2 & 2 \\ 1 & \color{green}2 & \color{red}3 & \color{pink}3 & 3 \\ 4 & 3 & 2 & 1 & 4 \\ 2 & 1 & 4 & 2 & 1 \\ \end{array} } \right]. $

$\color{blue}4_1$ has $7$ possible options, $3$ of which are shared by $\color{blue}4_2$ (highlighted in green), so when searching for the next element in the sequence, we have $9$ options if the element is not $4$ between $\color{blue}4_1$ and $\color{blue}4_2$.

But now make this change, moving $\color{blue}4_2$ to the bottom right corner:

$ M= \left[ {\begin{array}{ccccc} 1 & 2 & \color{pink}3 & 1 & 2 \\ \color{pink}3 & \color{blue}4_1 & 1 & 2 & 2 \\ 1 & 2 & \color{red}3 & \color{pink}3 & 3 \\ 4 & 3 & 2 & \color{blue}4_2 & 4 \\ 2 & 1 & 4 & 2 & 1 \\ \end{array} } \right]. $

In this variant, no nodes are shared between $\color{blue}4_1$ and $\color{blue}4_2$ so there are $14$ cells that have the possibility of satisfying the last sequence, which is why the number and the position of the number affect the probability.

Despite this fact that each step is not independent of the last, I have found that $$ P(c,k) = \frac{1}{c}\left(1 - \left(1-\frac{1}{c}\right)^8 \right)^{k-1} $$ is a decent approximation to the probability (tested using a MonteCarlo method).

Here's my python code for simulating the probability:

import numpy as np
def matrix(m,c,k_vals):
    """ Returns 1 if k_vals is found in the Matrix generated.
    Note that values can not be re-used

    """
    # np.random.randint excludes the top value, so it draws ints from 1 to c
    M = np.random.randint(1,c+1, size=(m,m))
    center = m // 2

    if M[center,center] != k_vals[0]:
        # The center value is not the first digit, return false
        return 0
    visited = set()
    stack = [(center,center,1)]
    # path is only collected for verifying a path during debugging
    path = [(center, center)]

    # Terminates if stack is ever empty
    while stack:
        currentx, currenty, depth = stack.pop()
        path = path[:depth-1] + [ (currentx, currenty) ]
        visited.add((currentx, currenty))
        # When the depth of our node is the length of the sequence, we have found a path
        if depth == len(k_vals):
            return 1

        for dx in [-1,0,1]:
            for dy in [-1,0,1]:
                if (currentx + dx) < 0 or (currentx + dx) >= m or (currenty + dy) < 0 or (currenty + dy) >= m:
                    # This value is out of bounds, skip it.
                    continue
                if (currentx + dx,currenty + dy) not in visited and M[currentx + dx, currenty + dy] == k_vals[depth]:
                    # We have not seen this node before, and its value is the next element we are looking for
                    if (currentx + dx, currenty + dy) not in visited and (depth + 1) <= len(k_vals):
                        stack.append((currentx + dx, currenty + dy, depth + 1))

    # If we finished the while loop then we did not find a path
    return 0


vals = []
N = 10**5
m = 3
c = 10
k_vals = [3,4]
for _ in range(N):
    vals.append(matrix(m,c,k_vals))
print("MC estimated Probability:",sum(vals)/len(vals))
print("3x3 Analytic Probability:",  1/c*(1 - (1-1/c)**8 ))
print("MxM Estimated Probability with k < m/2", 1/c*(1 - (1-1/c)**8 )**(len(k_vals)-1))

I think finding an analytic probability for anything larger than the $3\times3$ is going to prove to be very difficult