Proving a multivariate normal distribution gets the maximum entropy when mean and covariance are given

319 Views Asked by At

I'm working on a homework question. The first part was:

Given an unbounded one dimensional continuous random variable: $X\in\left(-\infty,\infty\right)$, that satisfies:$\left\langle X\right\rangle =\mu,\;\left\langle \left(X-\mu\right)^{2}\right\rangle =\sigma^{2}$ Show that the distribution that maximizes entropy is Gaussian $X\sim N\left(\mu,\sigma^{2}\right)$.

I've solved this using Lagrange multipliers method. The next part is proving the same holds in the case of multivariate distributions.

Generalize the previous part to a $k$ dimensional variable $X$ with given expectation value $\vec{\mu}$ and covariance matrix $\Sigma$.

I started the same way when I define the proper functional I wish to optimize: $$ F\left[f_{X}\left(\overline{x}\right)\right]=H\left(X\right)+\lambda\left(1-\intop_{\mathbb{R}^{k}}f_{X}\left(\overline{x}\right)d\overline{x}\right)+\sum_{i\in\left[k\right]}\varGamma_{i}\left(\mu_{i}-\intop_{\mathbb{R}^{k}}\overline{x}_{i}f_{X}\left(\overline{x}\right)d\overline{x}\right)+\sum_{i,j\in\left[k\right]}\Lambda_{ij}\left(\Sigma_{ij}-\intop_{\mathbb{R}^{k}}\left(\mu_{i}-\overline{x}_{i}\right)\left(\mu_{j}-\overline{x}_{j}\right)f_{X}\left(\overline{x}\right)d\overline{x}\right)$$

After taking the functional derivative with regard to $f_X(\overline{x})$ and extracting the PDF, I get the following term with the Lagrange multipliers: $$f_{X}\left(\overline{x}\right)=\exp\left(\lambda-1\right)\exp\left(-\Gamma\cdot\overline{x}-\left(\vec{\mu}-\overline{x}\right)^{T}\Lambda\left(\vec{\mu}-\overline{x}\right)\right)$$ Where $\Gamma(k\times 1),\Lambda (k\times k),\lambda(1\times 1)$ are the multipliers. I wish to show that these multipliers must equal the correct terms for this PDF to be a multivariate Gaussian.

For some reason this is where I get stuck, I've tried various algebraic manipulations, but the term $\exp{(-\Gamma\cdot\overline{x})}$ keeps messing up my calculations. I'm leaving out the constraints themselves since they are written within the optimization term, and they leave me in the mess with $\exp{(-\Gamma\cdot\overline{x})}$ when I try to solve the integrals.

I feel like I'm missing something. Would really appreciate it!

Edit:

  1. Badly enough the answers for this exercise just say - "yeah this looks gaussian, so we can find the parameters so it works out", albeit this isn't strictly a math course, I have a very hard time accepting this answer, so the question is highly relevant.

  2. There might be an easier way to solve this via KL divergence, but since the first part of the question was required for the next part I would really like to see this through.

Another Edit:

I get the calculations needed, my main issue is with the fact that the constraint $$\sum_{i\in\left[k\right]}\varGamma_{i}\left(\mu_{i}-\intop_{\mathbb{R}^{k}}\overline{x}_{i}f_{X}\left(\overline{x}\right)d\overline{x}\right)$$ is not needed, since when solving the equation I can assign $\Gamma=0$ and get the proper results. I would like a rigorous explanation why the mean is determined solely by the 3rd constraint.

4

There are 4 best solutions below

0
On BEST ANSWER

Here, I first provide a proof with required details based on the relative entropy method, and then I discuss how your main issue on dropping the expectation constraint when using the KKT method can be solved. Indeed, without the expectation constraint, the covariance of the distribution cannot be fixed, and you will have many other feasible normal densities satisfying the KKT conditions and both first and third constraints (see PS3 for more details).


Let $f_X \sim \mathcal N (\mu,\Sigma)$, for which $$\color{blue}{H(X)=\frac{1}{2} \log\left ((2\pi e)^n\det{\Sigma} \right)}.$$ Then, one can see that

$$\log f_{X}(x) = \underbrace{-\frac{1}{2} \log((2 \pi)^n\det{\Sigma})}_{\color{blue}{A:=-H(X)+\frac{1}{2} \log(e^n)}} -\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu). \tag{1}$$

For any other density $f_Y$ with $\mathbb E (Y)=\mu$ and $\text{cov}(Y)=\Sigma$, the relative entropy is computed as

$$D_{\text{KL}}(f_Y\|f_X)=\int_{\mathbb R^n} f_{Y}(x)\log \frac{f_{Y}(x)}{f_{X}(x)}\text{d}x=\underbrace{\int_{\mathbb R^n} f_{Y}(x)\log f_{Y}(x)\text{d}x}_{-H(Y)}-\int_{\mathbb R^n} f_{Y}(x)\log f_{X}(x)\text{d}x. $$

From (1), the second integral can be written as

$$\int_{\mathbb R^n} f_{Y}(x)\log f_{X}(x)\text{d}x =A \underbrace{\int_{\mathbb R^n} f_{Y}(x)\text{d}x}_{1}-\frac{1}{2}\underbrace{\int_{\mathbb R^n} \left( (x-\mu)^T\Sigma^{-1}(x-\mu) \right) f_{Y}(x)\text{d}x}_{\mathbb E \left ((Y-\mu)^T\Sigma^{-1}(Y-\mu) \right)=n}\\=A-\frac{1}{2}n=-H(X).$$

Hence,

$$D_{\text{KL}}(f_Y\|f_X)= -H(Y)+H(X)\ge 0 \equiv \color{blue}{H(X)}\ge H(Y), $$

which holds for every $Y$. This completes the proof.


PS1: Why $\color{blue}{D_{\text{KL}}(f_Y\|f_X) \ge 0}$?

$$D_{\text{KL}}(f_Y\|f_X)=\int_{\mathbb R^n} f_{Y}(x)\log \frac{f_{Y}(x)}{f_{X}(x)}\text{d}x=\mathbb E \left (\log \frac{f_{Y}(Y)}{f_{X}(Y)} \right )=\mathbb E \left (-\log \frac{f_{X}(Y)}{f_{Y}(Y)} \right ) \ge -\log \mathbb E \left (\frac{f_{X}(X)}{f_{Y}(Y)} \right )=-\log \int_{\mathbb R^n} f_{X}(x)\text{d}x=-\log 1=0,$$

where the inequality follows from the Jensen inequality as $-\log$ is a convex function.

PS2: Why $\color{blue}{\mathbb E \left ((Y-\mu)^T\Sigma^{-1}(Y-\mu) \right)=n}$?

Let us define $Z=\Sigma^{-\frac{1}{2}}(Y-\mu)$. Then, $$\mathbb E (Z) =0.$$

$$\text{cov}(Z)=\mathbb E (ZZ^T)=\mathbb E (\Sigma^{-\frac{1}{2}}(Y-\mu)(Y-\mu)^T\Sigma^{-\frac{1}{2}})=\Sigma^{-\frac{1}{2}} \mathbb E (Y-\mu)(Y-\mu)^T \Sigma^{-\frac{1}{2}}=\Sigma^{-\frac{1}{2}}\Sigma \Sigma^{-\frac{1}{2}} =I$$

Thus, all $Z_i, i=1,\dots,n$ are uncorrelated with mean $0$ and variance $1$, which gives

$$\mathbb E \left ((Y-\mu)^T\Sigma^{-1}(Y-\mu) \right)=\mathbb E \left (Z^TZ \right)=\sum_{i=1}^n \mathbb E (Z_i^2)=n.$$

Note that $\Sigma$ is positive definite and symmetric, and it can be written as $$\Sigma=\Sigma^{\frac{1}{2}}\Sigma^{\frac{1}{2}}.$$

PS3: On your main issue

Regarding your main issue in applying the KKT method:

I get the calculations needed, my main issue is with the fact that the constraint $$\sum_{i\in\left[k\right]}\varGamma_{i}\left(\mu_{i}-\intop_{\mathbb{R}^{k}}\overline{x}_{i}f_{X}\left(\overline{x}\right)d\overline{x}\right)$$ is not needed, since when solving the equation I can assign $\Gamma=0$ and get the proper results. I would like a rigorous explanation why the mean is determined solely by the 3rd constraint.

You already obtained only one of the possible candidate solutions (you found the most visible one). Without the expectation constraint, the problem has multiple KKT feasible solutions (densities). In fact, following your steps, you can find many other solutions other than what you already obtained. Let me show why there are other solutions without the expectation constraint in the following examples.

Example 1. The covariance constraint:

$$\mathbb E XX^T=\Sigma +\mu\mu^T \tag{1} $$

for any $ w \in [0,1]$, can be written as

$$\mathbb E (X-\sqrt{w}\mu)(X-\sqrt{w}\mu)^T=\Sigma+(1-w)\mu\mu^T,$$

which shows $X$ with

$$\text{cov}(X)=\Sigma+(1-w)\mu\mu^T, \mathbb E (X) =\sqrt{w}\mu$$

can be another feasible solution.

Example 2. To provide another example, note that $\Sigma$ can be decomposed as follows:

$$\Sigma=\sum_{j\in [n]}a_ja_j^T,$$

where $a_1,\dots,a_n$ are linearly independent. Thus, for any $a_i$ that $a_j,j \neq i, j\in[n]$ and $\mu$ are linearly independent, the covariance constraint (1) can be written as

$$\mathbb E (X-a_i)(X-a_i)^T=\sum_{j\in [n]: j \neq i} a_ia_i^T + \mu\mu^T.$$

This indicates that $X$ with $$\text{cov}(X)=\sum_{j\in [n]: j\neq i} a_ia_i^T + \mu\mu^T, \mathbb E (X) =a_ia_i^T$$

can be another solution.

Hence, in lack of $\mathbb E (X)=\mu$, following your steps, you can prove that any of the above distributions (and many other possible choices) with different $\text{cov}(X)$ and $\mathbb E (X)$ can be feasible KKT solutions, and picking the best among them needs solving another optimization problem discussed below.

When you drop the expectation constraint, the optimization problem reduces to the following semi-definite programming model with non-linear matrix inequalities:

$$\max \log \det W$$

subject to:

$$W=\Sigma +\mu\mu^T-yy^T \succeq 0, y\in\mathbb R^n,$$

which finds the normal density with maximum entropy in the set of normal distributions satisfying $\mathbb E XX^T=\Sigma +\mu\mu^T$ ($\mathbb E (X)=y$ and $\text{cov}(X)=W$). Solving the above problem seems difficult at first glance, but as the objective is increasing in $W$ with respect to the order $\succeq$, the upper bound is obtained by setting $y=0$. This means that the density with maximum entropy with fixed second moments (1), $\color{blue}{\mathbb E XX^T=\Sigma +\mu\mu^T}$, is a zero-mean normal distribution with $$\mathbb E (X)=0, \text{cov}(X)=\Sigma +\mu\mu^T.$$


If instead of the constraint $$\mathbb E XX^T=\Sigma +\mu\mu^T,$$ the KKT conditions are written for $$\color{blue}{\mathbb E (X-\mu)(X-\mu)^T=\Sigma},$$ as done in the OP, and the expectation constraint is dropped, we obtain

$$\Sigma= \mathbb E (X-\mu)(X-\mu)^T=\mathbb E(X-\mu_X)(X-\mu_X)^T+(\mu_X-\mu)(\mu_X-\mu)^T \tag{2}.$$

Hence, by fixing $\mu_X \in \mathbb R^n$ to any value for which $$\Sigma- (\mu_X-\mu)(\mu_X-\mu)^T \succeq 0,$$ we can follow your procedure to show that the normal distribution $$\mathcal N(\mu_X,\Sigma- (\mu_X-\mu)(\mu_X-\mu)^T)$$ is a feasible KKT solution. Now among all of the above KKT solutions, the one with $\mu_X=\mu$ achieves the maximum entropy because $\log\det A$ is an increasing function in $A$ with respect to $\succeq$. It is interesting that (unlike the case that the constraint $\mathbb E XX^T=\Sigma +\mu\mu^T$ is used) the optimal solution obtained by considering $\mathbb E (X-\mu)(X-\mu)^T=\Sigma$ is $\mathcal N(\mu,\Sigma)$ (differing from $\mathcal N(0,\Sigma+\mu\mu^T)$). However, if the expectation constraint is considered, for both cases we obtain only one feasible KKT point of $\mathcal N(\mu,\Sigma)$, which is also optimal. Now we can have the following concluding statement:

If the first and third constraints in the OP are only considered, the nominal mean used in the third (tagged (2) in my answer) is determined as the mean of the optimal distribution because of its better performance among other candidate densities. This observation cannot be generalized by noting that if (2) is replaced with the equivalent constraint (1), the optimal density has zero mean.

1
On

Let's compute the constrained functionals in terms of the dual variables, which might clarify things. The idea is almost completely illustrated via the computations for

$$ A(\lambda, \Gamma, \Lambda) = \exp(\lambda-1) \int \exp( -\Gamma^\top x - (x-\mu)^\top \Lambda (x-\mu) )\mathrm{d}x. $$

Of course, this only makes sense if $\Lambda$ is positive definite, since otherwise the exponential blows up over the integration domain. If $\Lambda$ is positive definite, then there exists a positive definite matrix $\Lambda^{1/2}$ such that $(\Lambda^{1/2})^2 = \Lambda$. Substitute $x = \Lambda^{-1/2}y + \mu$ to get

$$ A \cdot \exp(1-\lambda) =\exp(-\Gamma^\top \mu) \int (\det \Lambda^{-1/2}) \exp( -\Gamma^\top \Lambda^{-1/2} y - y^\top y)\mathrm{d}y.$$ Let me write $v = -\Lambda^{-1/2} \Gamma/2$. Then we have \begin{align} A \cdot \exp(1-\lambda) \exp(\Gamma^\top \mu) \sqrt{\det \Lambda} &= \int \exp( 2v^\top y - y^\top y)\mathrm{d}y\\ &= \int \exp( - (y-v)^\top (y-v) + \|v\|^2)\mathrm{d}y\\ &= \exp(\|v\|)^2 \int \exp(-\|z\|^2)\mathrm{d}z\\ &= (\pi)^{k/2} \exp(\|v\|^2),\end{align} where I've used the substitution $y = z + v$ and the final equality is just the Gaussian integral. Translating everything back, we get $$ A(\lambda, \Gamma, \Lambda) = (\pi)^{k/2} \exp(\lambda - 1 - \Gamma^\top \mu + \|\Lambda^{-1/2} \Gamma\|^2/4) (\det \Lambda)^{-1/2}. $$

Now similarly define $$ B(\lambda, \Gamma, \Lambda) := \exp(\lambda-1) \int x\exp( -\Gamma^\top x - (x-\mu)^\top \Lambda (x-\mu) )\mathrm{d}x. $$

Then carrying out exactly the same manipulations, we will end up at $$B(\lambda, \Gamma, \Lambda) = A(\lambda, \Gamma, \Lambda) \cdot \int (\pi)^{-k/2} (\Lambda^{-1/2} y + \mu) \exp(- \|y - \Lambda^{-1/2}\Gamma/2\|^2 ) \mathrm{d}y\\ = A(\lambda, \Gamma, \Lambda) \cdot (\mu + \Lambda^{-1}\Gamma/2). $$

Further, for $$ C(\lambda, \Gamma, \Lambda) := \exp(\lambda-1) \int (x-\mu)(x-\mu)^\top \exp( -\Gamma^\top x - (x-\mu)^\top \Lambda (x-\mu) )\mathrm{d}x, $$

we end up with $$C = A(\lambda, \Gamma, \Lambda) \int (\pi)^{-k/2} \Lambda^{-1/2} yy^\top \Lambda^{-1/2} \exp( - \|y - \Lambda^{-1/2} \Gamma/2\|^2) \mathrm{d}y \\ = A(\lambda, \Gamma, \Lambda) \left(\Lambda^{-1}/2 +\Lambda^{-1} \Gamma \Gamma^\top \Lambda^{-1}/4 \right),$$ where I've used the usual decomposition $\mathrm{Cov}(X) = \mathbb{E}[XX^\top] -\mathbb{E}[X] \mathbb{E}[X]^\top.$


Now, the constraints are that $A = 1, B = \mu, C = \Sigma.$ Plugging $A = 1$ into $B,C$, this yields $$ \mu + \Lambda^{-1} \Gamma/2 = \mu, \Lambda^{-1}/2 + \Lambda^{-1} \Gamma \Gamma^\top \Lambda^{-1}/4 = \Sigma,$$ and the solution over positive definite $\Lambda$ is exactly $\Gamma = 0, \Lambda = \Sigma^{-1}/2$ as expected. This in turn gives $$ A(\lambda, 0, \Sigma^{-1}/2) = \pi^{k/2}\exp(\lambda - 1) (\det (\Sigma^{-1}/2)^{-1} = \exp(\lambda - 1) \cdot (2\pi)^{k/2} \det \Sigma,$$ and enforcing that this is $1$ yields the appropriate normalisation for the Gaussian.

1
On

'"Badly enough the answers for this exercise just say - "yeah this looks gaussian, so we can find the parameters so it works out"' -- this is absolutely a correct and rigorous thing to say, though it can maybe be worded/explained better. If you have a probability distribution of the form $$f_{X}\left(\overline{x}\right)=\exp\left(\lambda-1\right)\exp\left(-\Gamma\cdot\overline{x}-\left(\vec{\mu}-\overline{x}\right)^{T}\Lambda\left(\vec{\mu}-\overline{x}\right)\right)$$

or, more generally,

$$f_{X}\left(\overline{x}\right)=C\exp\left(-\vec{u}\cdot\overline{x}-\left(\vec{\mu}-\overline{x}\right)^{T}\Lambda\left(\vec{\mu}-\overline{x}\right)\right)$$

that satisfies:$\left\langle \vec{X}\right\rangle =\vec{\mu},\;\left\langle \left(\vec{X}-\vec{\mu}\right)^{2}\right\rangle =\Sigma$, then it is the Gaussian with mean $\vec{\mu}$ and covariance $\Sigma$.

The point is that $\left(-\vec{u}\cdot\overline{x}-\left(\vec{\mu}-\overline{x}\right)^{T}\Lambda\left(\vec{\mu}-\overline{x}\right)\right)$ is a (affine, or inhomogeneous) quadratic form, so in order for this to be a distribution $\Lambda$ has to be positive definite, and then your distribution is a multivariate Gaussian. Since it is a distribution, $C$ must take the unique value that normalizes the total integral of the density to 1. Furthermore, a Gaussian is uniquely specified by its mean vector and covariance matrix. So if you know that your Gaussian has mean $\vec{\mu}$ and variance $\Sigma$ then it must be the $N(\vec{\mu}, \Sigma)$. (In your case this implies $\Lambda = \Sigma/2$ and $\Gamma=\vec{0}$, but that's kind off beside the point.)

4
On

The proof of this can be made very simple. You want to minimize the following quantity \begin{align} H[\rho] = \int_{\mathbb{R}^d} \rho \log \rho \, \mathrm{d}x \, , \end{align} where $\rho$ is the p.d.f/law of the random variable $X$, subject to the constraint that $\rho$ has mean $\mu$ and covariance $\Sigma$. Denote by $N_{\Sigma,\mu}$ the corresponding Gaussian p.d.f. Let $\rho$ be a p.d.f which satisfies this constraint. We then have that \begin{align} H[\rho]= & \int_{\mathbb{R}^d} \rho \log \frac{\rho}{N_{\Sigma,\mu}} \, \mathrm{d}x +\int \rho \log N_{\Sigma,\mu} \, \mathrm{d}x \\ =&\int_{\mathbb{R}^d} \rho \log \frac{\rho}{N_{\Sigma,\mu}} \, \mathrm{d}x -\frac{d}{2} -\frac 12 \log (2\pi)^d|\Sigma| \qquad (*) \,. \end{align} The first term in (*) is the relative entropy between $ \rho$ and $N_{\Sigma,\mu}$ and the other terms are just fixed constants. The relative entropy (or what someone in the comments called the KL divergence) is always non-negative, so if you could make it zero you would have minimized the above quantity and you would be done. You can do this by choosing $\rho =N_{\Sigma,\mu}$.

Furthermore, here's a short proof that the relative entropy is non-negative. Using Jensen's inequality (since $x \log x$ is convex), we have \begin{align} \int_{\mathbb{R}^d} \rho \log \frac{\rho}{N_{\Sigma,\mu}} \, \mathrm{d}x = &\int_{\mathbb{R}^d} \frac{\rho}{N_{\Sigma,\mu}} \log \left(\frac{\rho}{N_{\Sigma,\mu}} \right) N_{\Sigma,\mu} \, \mathrm{d}x \\ \geq & 1 \cdot \log 1 =0 \, . \end{align}

As a reference, look at Proposition 9.12 of the following book: https://link.springer.com/book/10.1007/978-1-4612-4286-4. It contains the result you want but in more generality. Just a note: the authors of this book are trying the maximize the entropy because they have flipped its sign (this is the physicist definition of entropy).