For simplicity sake, let's say $M$ is a two player, constant sum game. This gives us uniqueness of NE payoffs and 'interchangeability' of strategies.
We can solve for Nash equilibrium using linear programming methods, or use a suitable adversarial bandit algorithm for each player's selection policy over a number of rounds and finally take the empirical strategies as an approximation.
Suppose though our game is very large, and we do not have knowledge of either players action space. In fact, this problem is formulated over $t$ rounds and every round we observe a new action for each player. I don't currently know what to assume about this process though (e.g. is it stochastic over each action space?)
Is there any work on solving games in this context? What about similar scenarios? Currently I'm using exp3 as a bandit algorithm, and when a new team is introduced, it is assigned a logit score that is the median of the highest and lowest scores. So a new action is $r$ times more likely to be selected than the least likely, and $r$ times less likely to be selected than the most likely.Actions are pruned if their selection prob is less than $\frac{c}{k \ln{k}}$ where $k$ is the number of actions. This is a request for papers and comments.