I'm working on a homework question. The first part was:
Given an unbounded one dimensional continuous random variable: $X\in\left(-\infty,\infty\right)$, that satisfies:$\left\langle X\right\rangle =\mu,\;\left\langle \left(X-\mu\right)^{2}\right\rangle =\sigma^{2}$ Show that the distribution that maximizes entropy is Gaussian $X\sim N\left(\mu,\sigma^{2}\right)$.
I've solved this using Lagrange multipliers method. The next part is proving the same holds in the case of multivariate distributions.
Generalize the previous part to a $k$ dimensional variable $X$ with given expectation value $\vec{\mu}$ and covariance matrix $\Sigma$.
I started the same way when I define the proper functional I wish to optimize: $$ F\left[f_{X}\left(\overline{x}\right)\right]=H\left(X\right)+\lambda\left(1-\intop_{\mathbb{R}^{k}}f_{X}\left(\overline{x}\right)d\overline{x}\right)+\sum_{i\in\left[k\right]}\varGamma_{i}\left(\mu_{i}-\intop_{\mathbb{R}^{k}}\overline{x}_{i}f_{X}\left(\overline{x}\right)d\overline{x}\right)+\sum_{i,j\in\left[k\right]}\Lambda_{ij}\left(\Sigma_{ij}-\intop_{\mathbb{R}^{k}}\left(\mu_{i}-\overline{x}_{i}\right)\left(\mu_{j}-\overline{x}_{j}\right)f_{X}\left(\overline{x}\right)d\overline{x}\right)$$
After taking the functional derivative with regard to $f_X(\overline{x})$ and extracting the PDF, I get the following term with the Lagrange multipliers: $$f_{X}\left(\overline{x}\right)=\exp\left(\lambda-1\right)\exp\left(-\Gamma\cdot\overline{x}-\left(\vec{\mu}-\overline{x}\right)^{T}\Lambda\left(\vec{\mu}-\overline{x}\right)\right)$$ Where $\Gamma(k\times 1),\Lambda (k\times k),\lambda(1\times 1)$ are the multipliers. I wish to show that these multipliers must equal the correct terms for this PDF to be a multivariate Gaussian.
For some reason this is where I get stuck, I've tried various algebraic manipulations, but the term $\exp{(-\Gamma\cdot\overline{x})}$ keeps messing up my calculations. I'm leaving out the constraints themselves since they are written within the optimization term, and they leave me in the mess with $\exp{(-\Gamma\cdot\overline{x})}$ when I try to solve the integrals.
I feel like I'm missing something. Would really appreciate it!
Edit:
Badly enough the answers for this exercise just say - "yeah this looks gaussian, so we can find the parameters so it works out", albeit this isn't strictly a math course, I have a very hard time accepting this answer, so the question is highly relevant.
There might be an easier way to solve this via KL divergence, but since the first part of the question was required for the next part I would really like to see this through.
Another Edit:
I get the calculations needed, my main issue is with the fact that the constraint $$\sum_{i\in\left[k\right]}\varGamma_{i}\left(\mu_{i}-\intop_{\mathbb{R}^{k}}\overline{x}_{i}f_{X}\left(\overline{x}\right)d\overline{x}\right)$$ is not needed, since when solving the equation I can assign $\Gamma=0$ and get the proper results. I would like a rigorous explanation why the mean is determined solely by the 3rd constraint.
Here, I first provide a proof with required details based on the relative entropy method, and then I discuss how your main issue on dropping the expectation constraint when using the KKT method can be solved. Indeed, without the expectation constraint, the covariance of the distribution cannot be fixed, and you will have many other feasible normal densities satisfying the KKT conditions and both first and third constraints (see PS3 for more details).
Let $f_X \sim \mathcal N (\mu,\Sigma)$, for which $$\color{blue}{H(X)=\frac{1}{2} \log\left ((2\pi e)^n\det{\Sigma} \right)}.$$ Then, one can see that
$$\log f_{X}(x) = \underbrace{-\frac{1}{2} \log((2 \pi)^n\det{\Sigma})}_{\color{blue}{A:=-H(X)+\frac{1}{2} \log(e^n)}} -\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu). \tag{1}$$
For any other density $f_Y$ with $\mathbb E (Y)=\mu$ and $\text{cov}(Y)=\Sigma$, the relative entropy is computed as
$$D_{\text{KL}}(f_Y\|f_X)=\int_{\mathbb R^n} f_{Y}(x)\log \frac{f_{Y}(x)}{f_{X}(x)}\text{d}x=\underbrace{\int_{\mathbb R^n} f_{Y}(x)\log f_{Y}(x)\text{d}x}_{-H(Y)}-\int_{\mathbb R^n} f_{Y}(x)\log f_{X}(x)\text{d}x. $$
From (1), the second integral can be written as
$$\int_{\mathbb R^n} f_{Y}(x)\log f_{X}(x)\text{d}x =A \underbrace{\int_{\mathbb R^n} f_{Y}(x)\text{d}x}_{1}-\frac{1}{2}\underbrace{\int_{\mathbb R^n} \left( (x-\mu)^T\Sigma^{-1}(x-\mu) \right) f_{Y}(x)\text{d}x}_{\mathbb E \left ((Y-\mu)^T\Sigma^{-1}(Y-\mu) \right)=n}\\=A-\frac{1}{2}n=-H(X).$$
Hence,
$$D_{\text{KL}}(f_Y\|f_X)= -H(Y)+H(X)\ge 0 \equiv \color{blue}{H(X)}\ge H(Y), $$
which holds for every $Y$. This completes the proof.
PS1: Why $\color{blue}{D_{\text{KL}}(f_Y\|f_X) \ge 0}$?
$$D_{\text{KL}}(f_Y\|f_X)=\int_{\mathbb R^n} f_{Y}(x)\log \frac{f_{Y}(x)}{f_{X}(x)}\text{d}x=\mathbb E \left (\log \frac{f_{Y}(Y)}{f_{X}(Y)} \right )=\mathbb E \left (-\log \frac{f_{X}(Y)}{f_{Y}(Y)} \right ) \ge -\log \mathbb E \left (\frac{f_{X}(X)}{f_{Y}(Y)} \right )=-\log \int_{\mathbb R^n} f_{X}(x)\text{d}x=-\log 1=0,$$
where the inequality follows from the Jensen inequality as $-\log$ is a convex function.
PS2: Why $\color{blue}{\mathbb E \left ((Y-\mu)^T\Sigma^{-1}(Y-\mu) \right)=n}$?
Let us define $Z=\Sigma^{-\frac{1}{2}}(Y-\mu)$. Then, $$\mathbb E (Z) =0.$$
$$\text{cov}(Z)=\mathbb E (ZZ^T)=\mathbb E (\Sigma^{-\frac{1}{2}}(Y-\mu)(Y-\mu)^T\Sigma^{-\frac{1}{2}})=\Sigma^{-\frac{1}{2}} \mathbb E (Y-\mu)(Y-\mu)^T \Sigma^{-\frac{1}{2}}=\Sigma^{-\frac{1}{2}}\Sigma \Sigma^{-\frac{1}{2}} =I$$
Thus, all $Z_i, i=1,\dots,n$ are uncorrelated with mean $0$ and variance $1$, which gives
$$\mathbb E \left ((Y-\mu)^T\Sigma^{-1}(Y-\mu) \right)=\mathbb E \left (Z^TZ \right)=\sum_{i=1}^n \mathbb E (Z_i^2)=n.$$
Note that $\Sigma$ is positive definite and symmetric, and it can be written as $$\Sigma=\Sigma^{\frac{1}{2}}\Sigma^{\frac{1}{2}}.$$
PS3: On your main issue
Regarding your main issue in applying the KKT method:
You already obtained only one of the possible candidate solutions (you found the most visible one). Without the expectation constraint, the problem has multiple KKT feasible solutions (densities). In fact, following your steps, you can find many other solutions other than what you already obtained. Let me show why there are other solutions without the expectation constraint in the following examples.
Example 1. The covariance constraint:
$$\mathbb E XX^T=\Sigma +\mu\mu^T \tag{1} $$
for any $ w \in [0,1]$, can be written as
$$\mathbb E (X-\sqrt{w}\mu)(X-\sqrt{w}\mu)^T=\Sigma+(1-w)\mu\mu^T,$$
which shows $X$ with
$$\text{cov}(X)=\Sigma+(1-w)\mu\mu^T, \mathbb E (X) =\sqrt{w}\mu$$
can be another feasible solution.
Example 2. To provide another example, note that $\Sigma$ can be decomposed as follows:
$$\Sigma=\sum_{j\in [n]}a_ja_j^T,$$
where $a_1,\dots,a_n$ are linearly independent. Thus, for any $a_i$ that $a_j,j \neq i, j\in[n]$ and $\mu$ are linearly independent, the covariance constraint (1) can be written as
$$\mathbb E (X-a_i)(X-a_i)^T=\sum_{j\in [n]: j \neq i} a_ia_i^T + \mu\mu^T.$$
This indicates that $X$ with $$\text{cov}(X)=\sum_{j\in [n]: j\neq i} a_ia_i^T + \mu\mu^T, \mathbb E (X) =a_ia_i^T$$
can be another solution.
Hence, in lack of $\mathbb E (X)=\mu$, following your steps, you can prove that any of the above distributions (and many other possible choices) with different $\text{cov}(X)$ and $\mathbb E (X)$ can be feasible KKT solutions, and picking the best among them needs solving another optimization problem discussed below.
When you drop the expectation constraint, the optimization problem reduces to the following semi-definite programming model with non-linear matrix inequalities:
$$\max \log \det W$$
subject to:
$$W=\Sigma +\mu\mu^T-yy^T \succeq 0, y\in\mathbb R^n,$$
which finds the normal density with maximum entropy in the set of normal distributions satisfying $\mathbb E XX^T=\Sigma +\mu\mu^T$ ($\mathbb E (X)=y$ and $\text{cov}(X)=W$). Solving the above problem seems difficult at first glance, but as the objective is increasing in $W$ with respect to the order $\succeq$, the upper bound is obtained by setting $y=0$. This means that the density with maximum entropy with fixed second moments (1), $\color{blue}{\mathbb E XX^T=\Sigma +\mu\mu^T}$, is a zero-mean normal distribution with $$\mathbb E (X)=0, \text{cov}(X)=\Sigma +\mu\mu^T.$$
If instead of the constraint $$\mathbb E XX^T=\Sigma +\mu\mu^T,$$ the KKT conditions are written for $$\color{blue}{\mathbb E (X-\mu)(X-\mu)^T=\Sigma},$$ as done in the OP, and the expectation constraint is dropped, we obtain
$$\Sigma= \mathbb E (X-\mu)(X-\mu)^T=\mathbb E(X-\mu_X)(X-\mu_X)^T+(\mu_X-\mu)(\mu_X-\mu)^T \tag{2}.$$
Hence, by fixing $\mu_X \in \mathbb R^n$ to any value for which $$\Sigma- (\mu_X-\mu)(\mu_X-\mu)^T \succeq 0,$$ we can follow your procedure to show that the normal distribution $$\mathcal N(\mu_X,\Sigma- (\mu_X-\mu)(\mu_X-\mu)^T)$$ is a feasible KKT solution. Now among all of the above KKT solutions, the one with $\mu_X=\mu$ achieves the maximum entropy because $\log\det A$ is an increasing function in $A$ with respect to $\succeq$. It is interesting that (unlike the case that the constraint $\mathbb E XX^T=\Sigma +\mu\mu^T$ is used) the optimal solution obtained by considering $\mathbb E (X-\mu)(X-\mu)^T=\Sigma$ is $\mathcal N(\mu,\Sigma)$ (differing from $\mathcal N(0,\Sigma+\mu\mu^T)$). However, if the expectation constraint is considered, for both cases we obtain only one feasible KKT point of $\mathcal N(\mu,\Sigma)$, which is also optimal. Now we can have the following concluding statement: