How does one code the generative adversarial network loss function?

116 Views Asked by At

I was reading Ian Goodfellow paper on GAN and I read that the loss function for GANs are :

$J^{(G)} = -J^{(J)} = \frac{1}{2} \mathbb{E}_{x \sim p_{\rm data}}\Big[ \log D(x)\Big] + \frac{1}{2} \mathbb{E}_{z} \Big[\log (1-D(G(z)))\Big]$

I saw a few examples of its implementation on mnist and people code the loss function like this :

d_loss_real = tf.nn.sigmoid_cross_entropy_with_logits(labels=d_labels_real, logits=d_logits_real)
d_loss_fake = tf.nn.sigmoid_cross_entropy_with_logits(labels=d_labels_fake, logits=d_logits_fake)

d_loss = tf.reduce_mean(d_loss_real + d_loss_fake)

g_loss = tf.reduce_mean(
    tf.nn.sigmoid_cross_entropy_with_logits(
        labels=tf.ones_like(d_logits_fake), 
        logits=d_logits_fake))

how is this code equals to the equation? ps: I don't if it's a right place to ask this but if u know pls answer me.

1

There are 1 best solutions below

0
On BEST ANSWER

There are two separate loss functions: $\mathcal{L}_D$, which is the loss for the discriminator (written as d_loss in your example), and $\mathcal{L}_G$, which is the loss for the generator (written as g_loss in your example). Let $X$ be the space of the data, $D: X\rightarrow [0,1]\subset\mathbb{R}$ the discriminator, and $G:\mathcal{U}\rightarrow X$ the generator (with $\mathcal{U}$ some latent noise space), with $z\in \mathcal{U}$.

The job of the discriminator is to detect fake samples, meaning we want to (1) maximize the log-probability of the discriminator under the true data distribution and (2) minimize the log-probability under the generated distribution: $$ \mathcal{L}_D = \frac{1}{2}(\mathcal{L}_{D,1} + \mathcal{L}_{D,2}) = -\frac{1}{2}\mathbb{E}_{x\sim p_\text{data}}\left[ \log D(x) \right] - \frac{1}{2}\mathbb{E}_{z\sim p_G}\left[ \log(1-D(G(z)) \right] \tag{1} $$ where the first term corresponds to (1) and the second to (2).

The generator, on the other hand, wants to maximize the probability of the discriminator failing on its generated samples. Hence, we could use $\tilde{\mathcal{L}}_G = -\mathcal{L}_D$. Notice that the first term of $\mathcal{L}_D$ has no dependence on $G$, so it can be removed from the loss, giving: $$ \tilde{\mathcal{L}}_G=\frac{1}{2}\mathbb{E}_{z\sim p_G}\left[ \log(1-D(G(z)) \right] \tag{2} $$

However, it turns out that this loss function is not particularly good, as it saturates (meaning the generator's loss gradient will be too small when it is performing poorly - exactly when it needs large gradients the most). Hence, people tend to prefer to use the following loss instead: $$ {\mathcal{L}}_G=-\mathbb{E}_{z\sim p_G}\left[ \log(D(G(z)) \right] \tag{3} $$

Now for the code. The sigmoid_cross_entropy_with_logits(labels,logits) corresponds to: $$ L(y,x) = y(-\log \sigma (x)) + [1-y](-\log(1-\sigma(x))) \tag{4} $$ where $\sigma$ is the sigmoid function, $y$ are the labels, and $x$ are the logits. In other words, $D(x)=\sigma(x)=\hat{p}(x)$ is the estimated probability that $x$ is part of the true data distribution from the discriminator.

Consider two datasets: $X_D$, from the true distribution, and $X_G$, from the generator. Clearly $ \mathbb{E}_{x\sim p_\text{data}}[f(x)]\approx\sum_{x\in X_D} f(x)/|X_D|$ and $ \mathbb{E}_{x\sim p_G}[f(x)]\approx\sum_{x\in X_G} f(x)/|X_G|$, which are encapsulated by the reduce_mean functions.

Ok, the first line is:

d_loss_real = tf.nn.sigmoid_cross_entropy_with_logits(
                    labels=d_labels_real, 
                    logits=d_logits_real)

This is term 1 in equation (1) i.e. for $\mathcal{L}_D$. The labels are all $y\equiv 1$ and the logits are the results of the discriminator $D(x)=\sigma(x)$: $$ L_1 = \frac{1}{|X_D|}\sum_{x\in X_D} L(y,x) = \frac{1}{|X_D|}\sum_{x\in X_D} 1(-\log D(x)) + 0 = \frac{1}{|X_D|}\sum_{x\in X_D} -\log D(x) \approx \mathcal{L}_{D,1} $$

Alright, the next line:

d_loss_fake = tf.nn.sigmoid_cross_entropy_with_logits(
                    labels=d_labels_fake, 
                    logits=d_logits_fake)

which corresponds to the second term of equation (1) is computed over $X_G$ with $y\equiv 0$:

\begin{align} L_2 &= \frac{1}{|X_G|}\sum_{x_g\in X_G} L(y,x_g)\\ &= \frac{1}{|X_G|}\sum_{x_g\in X_G} 0 + [1-0](\log [1-D(x_g)])\\ &= \frac{1}{|X_G|}\sum_{x_g\in X_G} -\log(1- D(x_g))\\ &= \frac{1}{|X_G|}\sum_{z_g\in Z_G} -\log(1- D(G(z_g))\\ &\approx \mathcal{L}_{D,2} \end{align} where $x_g = G(z_g)$ and $z_g\in Z_g=\{ z_g \;|\; G(z_g) = x_g, \;\forall x_g \in X_G \}$. Then the next line

d_loss = tf.reduce_mean(d_loss_real + d_loss_fake)

simply means that $$ \hat{L}_D = \frac{1}{2}(L_1 + L_2) \approx \frac{1}{2}(\mathcal{L}_{D,1} + \mathcal{L}_{D,2}) = \mathcal{L}_{D} $$

So we have taken care of the equivalence of the discriminator. The final line is for the generator:

g_loss = tf.reduce_mean(
    tf.nn.sigmoid_cross_entropy_with_logits(
    labels=tf.ones_like(d_logits_fake), 
    logits=d_logits_fake))

Notice that the labels are all ones $y\equiv 1$, but over $X_G$. So:

\begin{align} \hat{L}_G &= \frac{1}{|X_G|}\sum_{x_g \in X_G} L(y,x_g) \\ &= \frac{1}{|X_G|}\sum_{x_g \in X_G} 1(-\log \sigma(x_g)) + [1-1](-\log(1-\sigma(x_g))) \\ &= \frac{1}{|X_G|}\sum_{x_g \in X_G} -\log D(x_g) \\ &= \frac{1}{|X_G|}\sum_{z_g \in Z_G} -\log D(G(z_g)) \\ &\approx \mathcal{L}_G \end{align} as in equation (3).


TL;DR: equation (3) begin used instead of equation (2) is probably the source of your confusion.


See also: [1], [2], [3]