I'm reading Ian Goodfellow’s article about Generative Adversarial Networks (https://arxiv.org/pdf/1701.00160.pdf) and, on page 22, I found a sentence that I don’t understand.
It’s about the GAN convergence evaluated with the Nash game equilibrium.
Ian say”
“… that the game converges to its equilibrium if both players’ policies can be updated directly in function space. In practice, the players are represented with deep neural nets and updates are made in parameter space, so these results, which depend on convexity, do not apply.”
Ok, the equilibrium is researched via parameter space, but I don’t well understand what is the function space and where is the difference of searching in function space Vs. parameter space.
I suggest referring back to the original GAN paper (pg. 5):
after which it is said:
So there are three effects in play here:
The generator and discriminator are artificial neural networks, and therefore do not have infinite capacity. This is why the first sentence of proposition 2 stipulates $G$ and $D$ having enough capacity. Note that continuous function spaces are infinite dimensional, so moving in the finite parameter space of the model will not be the same as moving in the full function space.
It's even worse, because $G$ is only implicitly representing $p_g$. Even if $G$ is capable of generating a given image, the input latent noise $z\sim U$ may not allow it to do so. Other issues can occur due to e.g. topological issues (if the true data distribution has two separate modes with zero density in between, the vanilla GAN will be unable to represent this). In other words, the distributions that can be represented by $G$ (here, the density functions) are limited (no matter the capacity of $G$) in the vanilla GAN formulation. Some of these limitations have been reduced in more modern GANs.
The convexity requirement. This is what is referred to in "... these results, which depend on convexity, do not apply." The fact is that the GAN value function $V(G,D)$ satisfies the following (paraphrased from the proof of prop 2 in the original GAN paper):
Since $U$ is convex, we are guaranteed to be able to reach the global optimum. But there's a problem: this convexity relies on a major assumption in prop 2, namely that "the discriminator is allowed to reach its optimum given $G$". In practice, this will not occur for most scenarios. Thus the theoretical convexity guarantee does not hold, and hence the global convergence guarantee does not hold.
For the last part of your question:
as noted above, the network only covers a small piece of "function space" (e.g., the space of smooth functions), a limitation exacerbated by the use of the implicit density representation. Searching in the parameter space is restricted to that small subset of function space, unlike the "direct updates" to $p_g$ required by the proof.