Seeming abitrariness of the Maximum Entropy Distribution

41 Views Asked by At

I have a two parameter model $C = (C_1, C_2) \in \mathbb{R}^2$ and would like to look at the choice of parameters from a stochastic point of view. A minimal set of constraints on $C$ is that

  1. The mean $\underline{C} = (\underline{C_1},\underline{C_2})$ is known, which can be written as $\mathbb{E}(C) = \underline{C}$;
  2. $\mathbb{E}\{g(C)\} = \alpha$ where $\alpha$ is finite and $G(C) = \log(C_1 C_2^{-3})$ (note that the choice of $\alpha$ is up to me, the point here is merely that it is finite, which is important given the $\log$ here);
  3. The normalisation constraint, which can be written as $\mathbb{E}\{1\} = 1$.

All three constraints can be written in the form of mathematical expactations, namely $\mathbb{E}\{F(C)\} = H$, where $F\,\colon\, \mathbb{R}^2 \to \mathbb{R}^4$ with $F(C) = (C,g(C),1)$ and $H = (\underline{C},\alpha,1) \in \mathbb{R}^4$,

In my naive understanding of the Maximum Entropy Principle, as discussed in the classical paper Information theory and statistical mechanics by ‎Jaynes it seems that in this case the maximum entropy p.d.f. of the random variable $C$ is given by $$ \rho_C(c) = \mathbb{1}_{\mathcal{S}}(c) \exp\{-\langle \lambda,\,F(c) \rangle_{\mathbb{R}^4}\} = \mathbb{1}_{\mathcal{S}}(c)\,k_4\,\exp\{-\lambda_1 c_1\} \exp\{-\lambda_2 c_2\} c_1^{-\lambda_3}c_2^{3\lambda_3} $$ where $\lambda = (\lambda_1,\dots,\lambda_4)$ is the set of Lagrange multipliers corresponding to constraints and $\mathcal{S}$ is the largest set in $\mathbb{R}^2$ for which the constraints are satisfied. Clearly from constraint 2. we have $C_1 > 0$ and $C_2 > 0$ so $\mathcal{S} = \mathbb{R}_+^2$. Note that $k_4 = \exp\{-\lambda_4\}$ is the normalisation constant.

One can see that in fact $$ \rho_C(c) = \rho_{C_1}(c_1) \times \rho_{C_2}(c_2) $$ and both $C_1$ and $C_2$ follow Gamma distribution, with hyperparameters (shape and scale): $$ (\alpha_1,\beta_1) = (1-\lambda_3,\underline{c_1}/(1-\lambda_3)) \quad\text{ and }\quad (\alpha_2,\beta_2) = (1+3\lambda_3,\underline{c_2}/(1+3\lambda_3)). $$ This is where my confusion starts. The parameter $\lambda_3$ is a sense free since as mentioned $\alpha$ in constaint 2. can vary and also $\lambda_3$ is joint for both random variables, and for both of them to be Gamma distributed it needs to satisfy $\lambda_3 \in (-1/3,1)$, as otherwise the integral of the density is not integrable near zero. But this severely limits our control of "statistical fluctuations", since a large negative $\lambda_3$ would mean the density $\rho_{C_1}$ strongly concentrated near the mean $\underline{c_1}$. There seem to be two ways of working with this.

(1) We persist with the current setup and e.g. if it is necessary to set $\lambda_3 = -100$, then presumably $C_2$ is not Gamma distributed and we have to exlude the origin and say that the support of $\rho_{C_2}$ is $(\epsilon,+\infty)$ but how does one choose $\epsilon$ here? It seems like any $\epsilon$ would do, which seems a bit nonsensical given the broader context.

(2) We can also resort to a change of variables to $B = (B_1,B_2) \in \mathbb{R}^2$ with $B_1 = C_1$ but $B_2 = C_2^{-1}$. Clearly there is a one-to-one correspondence between $\underline{C} \in \mathbb{R}_+^2$ and $\underline{B} \in \mathbb{R}_+^2$ and in the second constaint $G$ is replaced by $\tilde{G}(B) = \log(B_1B_2^3)$. If we use the same framework as above, the conclusion appears to be that $B_1$ and $B_2$ are both Gamma-distributed, this time with hyperparameters $$ (\alpha_1,\beta_1) = (1-\lambda_3,\underline{b_1}/(1-\lambda_3)) \quad\text{ and }\quad (\alpha_2,\beta_2) = (1-3\lambda_3,\underline{b_2}/(1+3\lambda_3)), $$ so now $\lambda_3 \in (-\infty,1/3)$ and we have the freedom to set $\lambda_3$ as negative as we like, so we can control the 'scale' of the relevant oscillations around the mean.

What strikes me here is the seeming arbtrariness of the maximum entropy principle in this case: in the alternative approach (2) we arrive at the conclusion that $B_2$ is Gamma distrubuted, so that would mean that $C_2 = B_2^{-1}$ follows 'Inverse Gamma distribution'. In the original setup $C_2$ follows Gamma distribution on the other hand. So which is it?

I suppose it all relies on the fact that in general in both settings $\lambda_3$ is not really constrained to lie in an interval, but rather that if $\lambda_3$ is within the intervals stated, then the the underlying random variables are indeed Gamma-distributed. If that is the case then I suppose the alternative approach (2) is the way to go, as it gives a nicer picture at the end. I would appreciate any comments people might have, thank you for going through this wall of text :)