The problem:
You have a 10 x 10 grid where each cell can be either occupied (1) or unoccupied (0).
All occupied cells in the grid are part of a single "blob": a single shape defined by a set of occupied coordinates for which each occupied coordinate has at least one occupied coordinate to either its top, left, bottom, and/or right.
- (From this definition it seems the blob must occupy at least two coordinates.)
How can you quickly generate a blob that is chosen randomly from the set of all possible such blobs? If it's not possible to generate such a blob quickly, why is it not possible?
The (probably-wrong) method I'm using right now:
- I choose a random cell from the 10 x 10 grid to be the first cell of the blob.
- I then choose randomly from the set of cells neighboring the first cell to create the second cell of the blob (to make sure every blob is valid).
- I then iterate through an updated-as-it's-added-to list of all of the current cells of the blob, considering each neighboring cell once (and only once).
- A certain percentage of the time, I add this neighboring candidate cell to the blob.
- Right now I'm using 50% as the percentage.
- If this method of generating the blob can work (i.e. it doesn't have any fundamental problems with it), I would need to know what this certain percentage should be.
My gut feelings of what the solution may look like:
- It seems like the blob is generally going to occupy most of the 10 x 10 grid.
- It seems like it might be better to first choose the size of the blob, and only then choose which cells are occupied.
- I suspect that the ideal way of creating the blob if I already know how big it should be would be to choose a random cell and then to repeatedly choose a random neighbor to the existing blob and make it part of the blob.
- I suspect it may be helpful to first create an algorithm for a simpler one-dimensional grid before trying to create one for a two-dimensional grid.
Intermediate conclusions I'm coming to:
- It looks like for the simpler case of a one-dimensional grid, the number of possible blobs is equal to the $(n-1)th$ triangular number.
- See the pictures below that show a one-dimensional grid. The pattern of possible shapes as you move from $n=2$ to $n=3$ to $n=4$ is a triangular-number pattern.
Clarifications:
- Two disconnected areas of occupied cells do not count as a single blob.
- Each possible blob needs to have an equal chance of being picked.
- Two blobs of the exact same size, shape, and orientation, but different location are considered different blobs.
- There are answers below which show how to generate a randomly-picked blob but where the randomly-selected blobs can't be generated quickly. I'm looking for a way to do it quickly or a proof that it's not possible to do it quickly.
Some pictures I drew to wrap my head around the issue:
Disclaimer:
- I'm dealing with an interview question where I'm allowed to use third party help as long as I disclose it. The interview question isn't to generate the blob, but by being able to correctly generate a randomly-chosen blob I can test several different possible solutions to the interview question.
Some clarifications of Misha's answer:
Q: Why do you say "this method is definitely biased"?
A: Because we already know that the unbiased method works exactly the same way except you only accept the outcome of the random selection when the rules for a blob are applied "strictly" (i.e. you only have a blob when there's a single connected group of occupied spaces), and so if you instead just pick an occupied space and use its connected occupied spaces as the blob, you'll at the very least be biased towards smaller blobs.
Q: What do the vertical bars represent?
A: They represent size / cardinality. So $B$ is the blob itself, and $|B|$ is how many cells it contains.
Q: Why use the $∂$ character?
A: In topology it refers to the boundary of a subset of a space.
Q: How did you figure out the probability for the biased method?
A: Well, $\frac{|B|}{100}$ is just the probability that you choose one of the cells of the blob, assuming it exists. So presumably the other part is the probability that it was created. To make it easier to understand, note that in the first formula you see that the odds of getting the strict blob you want (strict because you're not tolerating occupied cells not part of your blob) on any particular attempt is $2^{-100}$, because you need all 100 of those coin-flips to go exactly the way you want them. So in this less-strict situation, you need the coin-flips for your desired occupied spaces to go the way you want, and you also need the coin-flips for the unoccupied neighboring spaces to go the way you want, so that there's no chance you'll have a neighbor end up occupied and change the blob to a different one.
Q: How did you figure out the lower bound?
A: He just took some constraints he knew to be true and then used calculus(?) to combine those functions to determine what the lowest value could be. But to rationalize it: if you look at the formula for $p(B)$ and say "I want to make $p(B)$ as small as possible", you can see that you want to make the $(\frac 12)^{|B| + |∂B|}$ term as small as possible, and the way to do that is to make $|B|+|∂B|$ as big as possible. And the biggest that sum can be is 100, where the blob and its neighboring cells take up the entire grid. Then you want to make the $\frac {|B|}{100}$ term as small as possible, and the way to do that is to figure out the smallest blob that where the blob and its neighbors fill the entire grid. But since he didn't try to find an actual blob, the number he ended up with (33) may not actually correspond to a real blob (I tried creating a blob of size 33 that occupied or bordered every cell and I couldn't do it). I think the effect of that is to make it take longer for the program to run than if the number were more accurate. If the number was perfectly accurate, then you'd have a 100% chance of accepting the rarest blobs, but with the number less accurate, you have less than a 100% chance of accepting the rarest blobs, but it should still be unbiased.
Q: If the goal of multiplying by the acceptance probability is to get rid of the $p(B)$ term, why use $p^*$ in the numerator? Why not use "1"?
A: Using a numerator greater than the probability of getting the rarest blobs (which is roughly $p^*$) will result in a non-uniform distribution, and using a numerator that's smaller will make the program take longer to run. The reason using a numerator greater than $p^*$ will result in a non-uniform distribution is that it will lead us to accept some not-rarest blobs with the same probability as the rarest blobs. If we have the numerator as $p^*$, then when we actually come across one of those rarest blobs, the $p(B)$ will equal $p^*$, and so the acceptance probability will be "1" (we'll always accept that blob). But if the numerator is larger, not only will we always accept the rarest blobs, but we will also always accept the less-rare blobs whose $p(B)$ is such that dividing the larger numerator by $p(B)$ will result in an acceptance probability of at least 1.




A definitely correct but inefficient method is to
Step 1 is equally likely to generate each blob (as well as all the non-blobs), so the result (once a blob finally is generated) is equally likely to be any blob. More precisely, let $f$ be the probability that we get something that isn't a blob. Then the probability of generating your favorite blob $B$ on the $k^{\text{th}}$ trial is $f^{k-1} \cdot \frac{1}{2^{100}}$: we failed to get a blob $k-1$ times, and then finally got all the cells in $B$ and no others. So the overall probability of getting blob $B$ with this method is $\frac{2^{-100}}{1-f}$, which doesn't depend on $B$, and therefore all blobs are equally likely.
We choose a probability of $\frac12$ so that on each trial, any blob $B$ has the same probability of $2^{-100}$ of being generated. If the probability were $p \ne \frac12$, then the probability of generating a $k$-celled blob would be $p^k (1-p)^{100-k}$, which is not uniform.
Unfortunately, it will probably take a very long time to generate a blob with this method, since a random subset of the cells usually doesn't form a connected blob.
Next, here is an actually feasible method.
First, consider the following sampling method, which is not uniform:
Even though this method is definitely biased, it has the nice feature that we know exactly the probability with which a blob $B$ is returned in a single trial. That probability is $$ p(B) = \frac{|B|}{100}\left(\frac12\right)^{|B|+|\partial B|} $$ where $|B|$ is the number of occupied cells in $B$ and $|\partial B|$ is the number of unoccupied cells bordering $B$. Moreover, we can put a lower bound on $p(B)$ for any $B$: it is $p^* = \frac{33}{100} (\frac12)^{100}$. (This is the minimum of $\frac{x}{100} (\frac12)^{x+y}$ over all $x,y$ with $x\ge2$, $y \ge 0$, $x+y \le 100$, and $y \le 2x+2$: the constraints on $|B|$ and $|\partial B|$.)
Now modify the method above to make the following method:
Now, on any given trial, the probability that $B$ is the blob we get after step 2 is is $p(B)$. But the probability that we actually return $B$ is that probability times the acceptance probability: $p(B) \cdot p^*/p(B)$, or $p^*$. So on any given trial, every blob has the same chance $p^*$ of being produced.
Experimentally, the average value of $\frac{p^*}{p(B)}$ seems to be between $0.001$ and $0.0001$, which means the average number of times steps 1-2 are repeated is not $2^{100}$ (as with the first method) but fewer than $10000$. This means we can actually run this algorithm in a few seconds as opposed to the lifetime of the universe.