understanding why Wasserstein is weak

129 Views Asked by At

I am reading Wasserstein GAN paper and in Appendix A, it says

Let $\mathcal{X} \subseteq \mathbb{R}^d$ be a compact set (such as $[0, 1]^d$ the space of images). We define Prob($\mathcal{X}$) to be the space of probability measures over $\mathcal{X}$. We note $$C_b(\mathcal{X}) = \{f: \mathcal{X} \to \mathbb{R}, f \text{ is continuous and bounded}\}.$$ Note that if $f \in C_b(\mathcal{X})$, we can define $||f||_\infty = \max_{x \in \mathcal{X}} |f(x)|$, since $f$ is bounded. With this norm, the space $(C_b(\mathcal{X}, ||\cdot||_\infty)$ is a normed vector space. As for any normed vector space, we can define its dual $$C_b(\mathcal{X})^* = \{ \phi: C_b(\mathcal{X} \to \mathbb{R}, \phi \text{ is linear and continuous} \}$$ and give it the dual norm $||\phi|| = \sup_{f \in C_b(\mathcal{X}), ||f||_\infty \le 1} |\phi(f)|$.

With this definitions, $(C_b(\mathcal{X})^*, ||\cdot ||)$ is another normed space. Now let $\mu$ be a signed measure over $\mathcal{X}$, and let us define the total variation $$||\mu||_{TV} = \sup_{A \subseteq \mathcal{X}} |\mu(A)|$$ where the supremum is taken all Borel sets in $\mathcal{X}$. Since the total variation is a norm, then if we have $\mathbb{P}_r$ and $\mathbb{P}_\theta$ two probability distributions over $\mathcal{X}$, $$\delta(\mathbb{P}_r, \mathbb{P}_\theta):= ||\mathbb{P}_r - \mathbb{P}_\theta||_{TV}$$ is a distance in Prob($\mathcal{X}$).

We can consider $$\Phi: (\text{Prob}(\mathcal{X}), \delta) \to (C_b(\mathcal{X})^*, ||\cdot ||)$$ where $\Phi(\mathbb{P})(f) := \mathbb{E}_{x \sim \mathbb{P}}[f(x)]$. The Riesz Representation theorem (Kakutani, Theorem 10) tells us that $\Phi$ is an isometric immersion. This tells us that we can effectively consider $\text{Prob}(\mathcal{X})$ with the total variation distance as a subset of $C_b(\mathcal{X})^*$ with the norm distance.

The norm topology is very strong. Therefore, we can expect that not many functions $\theta \mapsto \mathbb{P}_\theta$ will be continuous when measuring distances between distributions with $\delta$...Now all dual spaces have a strong topology (induced by a norm), and a weak* topolgy. As the name suggests, the weak* topology is much weaker than the strong topology. In the case of Prob($\mathcal{X}$), the strong topology is given by the total variation distance, and the weak* topology is given by the Wasserstein distnace (among others).

Here are my questions:

  1. I know what Reisz Representation theorem is (but not sure what 7, Theorem 10 is referring to) but I'm not sure what isometric immersion is and how Riesz Representation tells us that $\Phi$ is an isometric immersion.

  2. I'm not too familiar with strong topology and weak* topology. Do certain norms induce different types of toplogies?

  3. the paper says that norm topology is very strong and not many functions $\theta \mapsto \mathbb{P}_\theta$ will be continuous when measuring distances between distribtuions with $\delta$. I don't quite understand why a strong topology implies that not many functions will be continuous. Shouldn't strong topology contain more continuous functions than weak topology?