bias searching for submatrix of higher sum in random matrices

129 Views Asked by At

This question is a follow up to the question that I asked in Stack Overflow. I now have a hint on the origin of the problem, but I still can't model it, and I was wondering if someone knows a way to do so.

The problem:

I generate 100,000 random matrices $11\times 11$ and I want to know which part of the random matrices has greater numbers. To do so I use a sliding window of $3\times 3$ that moves one cell at a time (overlapping windows, that are always inside the random matrix).

sliding window example

For each of the 100K random matrices I store the position of the sub-matrix with highest sum. I finally plot the cumulative result of the 100K highest sums.

This is the script I use, and the result I obtain:

    from random import random
    from matplotlib import pyplot as plt
    import numpy as np
    
    size = 11
    
    positions = np.zeros((size, size))
    
    for _ in range(100000):
        matrix = [[random() for _ in range(size)] for _ in range(size)]
        max_value = 0
        max_coord = 0, 0
        for beg in range(1, size - 1):
            for end in range(1, size - 1):
                suma = sum(matrix[i][j] 
                           for i in range(beg - 1, beg + 2) 
                           for j in range(end - 1, end + 2))
                if suma >= max_value:
                    max_value = suma
                    max_coord = beg, end
        positions[max_coord] += 1
    
    plt.imshow(positions[1:10,1:10], origin='lower')
    plt.colorbar()

100K simulation

The explanation I got was that the cells on the edges and, in particular in the corners visited less visits from the sliding window, and thus may have a more directed contribution to their corresponding sub-matrices.

this bellow shows how many times a cell is visited by a sliding window: enter image description here

And this matches some how the previous result.

BUT, in order to contrast better the result I continued with my simulation and went beyond the 100K (>300M), this is the result:

more_simulations

It seems that there are more categories, even in regions where the sliding window passes the same number of times.

If we plot the distribution of counts by cell:

    plt.figure(figsize=(12,8))
    vals, icoord, jcoord = zip(*sorted([(v, i, j) for i, l in enumerate(positions[0:9,0:9]) for j, v in enumerate(l)]))
    vals = [i / sum(vals) for i in vals]
    color_list = [tuple(c) for c in plt.cm.tab10(np.linspace(0,1, 10))]
    colors = [color_list[0]]
    ncolor = 0
    plt.plot(vals, color='grey', alpha=0.5, lw=1)
    for n, (a, b) in enumerate(zip(vals[:-1], vals[1::])):
        if b - a > 0.00002:
            ncolor += 1
        colors.append(color_list[ncolor])
    plt.scatter(range(len(vals)), vals, color=colors, alpha=0.75)
    plt.axhline(1 / 81, color='black', ls='--', alpha=0.7)
    plt.text(0, 1 / 81, 'expected', color='black', alpha=0.75)
    plt.xlabel('sorted cells')
    plt.ylabel('count per cell')
    plt.gca().yaxis.grid()

we obtain this:

enter image description here

I highlighted in different colours what seems to be clearly defined categories (I think that with more simulation I can find more categories).

As final dumb check I coloured previous matrix with these categories and check if the result was symmetric:

    plt.figure(figsize=(5.28, 5.3))
    plt.scatter(icoord, jcoord, c=colors, s=1000, marker='s')

enter image description here

I have now done enough simulations to correct for this effect in my analysis, but I am still very curious about if there could be some analytical solution to this problem.