I want to write a procedural generation algorithm that takes in sets of attributes, and creates an internal model of how often attributes occur together to generate new sets, not all of which may have been in the original list of sets used to seed it. However, I'm at a bit of a loss for how to go about this.
For example, we can take the starting sets $\{A, B, C\}$, $\{A\}$, $\{A, B\}$ $\{B, C\}$. I can make some observations about this list:
- Any set with $A$ has a 75% chance of also having $B$
- Any set with $B$ has a 66% chance of also having $A$
- A set has a 50% chance of having $A$ and $B$
- A set has a 0% chance of having neither $A$ nor $B$
- Any set with $A$ or $B$ has a 50% chance of having both
etc.
In particular the 5th observation here is of note to me, and it is ones of those form that I find most important. So just to focus in on what the result could look like, I can take that 5th observation and apply it to other pairs too:
- Any set with $A$ or $B$ has a 50% chance of having both
- Any set with $A$ or $C$ has a 25% chance of having both
- Any set with $B$ or $C$ has a ~67% chance of having both
I could could generate a set like $\{B\}$ from this, which was not in the original list of sets. Based on the data above, this combination should have around a ~17% chance of occurring among all sets that contain a $B$.
What I'm ultimately looking for is to take some or all of these observations on board and randomly generate a list of new (non-empty) sets of which broadly similar observations can be made (i.e. the probability of the relevant observations is the same in the source as in the generated lists).
What I don't know is which of those observations are useful to me and which are not. Mainly what I'd like to avoid is that my algorithm can only reproduce the sets it was seeded with (having too many constraints on generation). I assume this means that there might be multiple probability distributions that are correct for any given seed, so there could be multiple algorithms that are correct.
First of all, is there an existing algorithm that does what I need? That would be the most convenient outcome.
If that's not the case, does anyone know how an algorithm for this might look, or what kind of approach I should take for this? I think that perhaps my main issue is that I just don't how to proceed after producing these observations; how do I get from here to an algorithm that fulfils the probability requirements these observations indicate?
This problem seems to have some multi-dimensional characteristics and it's breaking my brain a bit at the moment because I don't really know how to conceptualise a solution.
Lastly, I'd like to avoid an algorithm with bad time complexity as the amount of distinct attributes I need to work with could get rather large. This is the least important of all (something that works is more important than being fast) but it's a consideration in case there are multiple approaches that work for this.