Gaussian Mixture Model: What is a "universal approximator of densities"?

2.2k Views Asked by At

When looking into Gaussian Mixture Models (GMMs), I encountered multiple times the statement that "GMMs are a universal approximator of densities" (e.g., [0]).

I'm not sure whether I understand this correctly, and if so, I would need a citeable source for this. The way I understand it is: Given any probability density distribution, there exists a GMM (with possibly many components) s.t. the distribution of the GMM approximates the given distribution to within arbitrary error.

My questions:

  • Did I understand the above correctly?
  • How is the "arbitrary error" specified?
  • What source can I cite if I want to use this fact? I could use [0], but the authors also just state this claim without proving it or providing a citation. Is this folklore?

[0] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning., p. 65

2

There are 2 best solutions below

0
On BEST ANSWER

It's a vague/fancy way to say the following simple result.

But first...some definitions and facts!:

  1. Let $\mathcal{P}(\mathbb{R}^d)\subset \mathcal{M}_f(\mathbb{R}^d)$ be the set of probability (resp. finite signed) Borel measures on $\mathbb{R}^d$ (with its Euclidean topology).
  2. Equip $\mathcal{M}_f(\mathbb{R}^d)$ with the weak$^{\star}$ topology (somewhat confusingly known in statistics as the topology of weak convergence/convergence in distribution).
  3. The point-masses $\{\delta_x\}_{x \in \mathbb{R}^d}$ span a dense subset of $\mathcal{M}_f(\mathbb{R}^d)$.
    Note: This is basically why we can to empirical estimation of probability measures.
  4. In this topology, given any $x \in \mathbb{R}^d$, the sequence of Gaussians $(\nu_{x,n})_{n \in \mathbb{N}}$ with mean $x$ and variance $\frac1{2^n}$ converge to $\delta_x$.
    Note: Here just look at the characteristic funciton/ Fourier transforms.
  5. This means that the closed span of the Gaussian measures contains the closed span of the point-masses, which was $\mathcal{M}_f(\mathbb{R}^d)$.
  6. The projection of $\mathcal{M}_f(\mathbb{R}^d)-\{0\}\to \mathcal{P}(\mathbb{R}^d)$ taking a measure $\mu\mapsto \frac{1}{\mu_+(\mathbb{R}^d)}\mu_+$ is a continous surjection; whence it takes dense sets to dense sets. Moreover, a direct computation shows that it takes linear combinations to convex combinations.

Conlusion:

"Gaussian mixtures" (a.k.a. convex combinatoins of Gaussian measures) are dense in $\mathcal{P}(\mathbb{R}^d)$ For the weak$^{\star}$ topology!

Bonuses, fun, and excitment for all: Try breaking the argument and showing that linear combinations of Gaussians cannot approximate (non-trivial) point-masses for stronger topologies on $\mathcal{M}_f(\mathbb{R}^d)$. In other words....they aren't capable of approximating all probability measures for the total variation topology (another rather standard notion of convergence of measures).

1
On

If somebody stumbles across this question, here are some results I was able to find after all. See also this reddit thread.

  • Linear combinations of Gaussians are dense in L2, i.e., can be made to approximate any square-integrable function arbitrary closely, according to [1]. Kudos to /u/LeChatTerrible on reddit for finding this
  • I think that [2] with its "Approximation Theorem" pretty much shows what I need.

[1] https://arxiv.org/pdf/0805.3795.pdf

[2] Kostantinos N. Plataniotis and Dimitris Hatzinakos. 2000. Gaussian mixtures and their applications to signal processing. In Advanced Signal Processing Handbook: Theory and Implementation for Radar, Sonar, and Medical Imaging Real Time Systems, Stergios Stergiopoulos (Ed.). CRC Press, Boca Raton, Chapter 3.