Optimization of a function over probability distributions

1.8k Views Asked by At

I'm trying to solve certain optimization problems dealing with probability distributions. Consider the space of probability distributions $\{ 1, ..., N\} \to [0, 1]$

I have a function $f : (\{ 1, ..., N\} \to [0, 1]) \to \mathbb{R}$ that I'd like to find either minimums or maximums.

I have an idea to parameterize the measure space over some real valued space like $\mathbb{R}^n$. This would let me find critical points using standard gradient methods. I was able to do this for the $N=2$ case by parametrizing $[0, 1]$ over the real line to determine the probability for $1$ and then the probability for $2$ follows from the probability for $1$ but I failed to even figure this out for $N=3$.

It's also fairly easy to parameterize the space of functions $\{ 1, ..., N\} \to \mathbb{R}$ that have additivity by treating them like vectors and parameterizing the linear space that satisfies the linear equation $\sum_{w\in{1,...,N}} p(w) = 1$.

$$p(1) = \sum_{w\in{2,...,N}} p(w)$$ $$p(2) = a_1$$ $$...$$ $$p(N) = a_{N-1}$$

where $a_i$ is free. Incorporating the bounds is harder however. I manually worked out the bounded space for N=3 and it winds up being the right triangle with leg lengths both 1 and hypotenuse root 2. I also worked it out for the N=4 case where it winds up being a tetrahedron with 3 faces as right triangles (just like the 1 before) and 1 face an equilateral. I have two problems with this however. I have 2 problems here however. 1) I don't know how to parameterize the tetraheadron (I have not tried parametrizing the triangle either but I assume it isn't hard) and 2) I don't how these shapes generalize to higher dimensions so I'm stuck at the N=4 case.

Specifically I am concerned with finding local minimums and maximums to the following function (cross entropy): $$p, q \mapsto \sum_{w\in{1,...,N}} p(w) \cdot log_2(\frac{1}{q(w)})$$

For the N=2 case (the only non-trivial case I have solved) it turns out this is maximized at $p(1) = \frac{1}{e}$ and $q(1) = \frac{1}{1+\frac{1}{e-1}}$ which is kinda neat because it doesn't depend on the base of the logarithm.

Does anyone know how to parameterize these probability distributions? Is there a better way to go about finding minimums and maximums here?

note: I've tagged convex analysis/optimization because probability distributions are convex functions.

2

There are 2 best solutions below

0
On BEST ANSWER

So I finally with a bit of help from a number of people seem to have figured this one out.

First note that we can map the sub-additive subspace of $\prod^N [0,1]$ onto the additive space of $\prod^{N+1} [0,1]$ by adding another dimension which is 1 minus the sum of the other dimensions. I believe this is a special case of barycentric coordinates.

Next take a look at this mapping where $V_{N+1}$ is assumed to be $0$ for the sake of clarity.

$$(V_1, ..., V_N)^T \mapsto (\prod^N_{i=1} V_i - \prod^{N + 1}_{i=1} V_i, ..., V_1 - V_1 \cdot V_2)^T$$

One can check that this maps $\prod^{N} [0,1]$ to the sub-additive subspace of $\prod^{N+1} [0,1]$

Composing the two we get this nice mapping (still using the fact that $V_{N+1}$ is zero and additionally using the fact that an empty map is

$$(V_1, ..., V_N)^T \mapsto (...,\prod^{N-n}_{i=1} V_i - \prod^{N - n + 1}_{i=1} V_i,...)^T$$ where $n$ goes from zero to N inclusive.

It can again be checked that this maps $\prod^{N} [0,1]$ to the additive subspace of $\prod^{N+1} [0,1]$

This seems like a nice (time will tell) analytical surjection from cubes to probability distributions and because additionally we can map Euclidean space surjectivelly onto a square of unit volume we can compose these maps to get a mapping from N-Euclidean space to an N+1-dimensional probability distribution. This lets us take the gradient and solve the system of equations which turn out to be remarkably solvable. I've also had good luck using this kind of method to do some other basic stuff.

6
On

I believe that using a simple vector in $\mathbb{R}^n$ will suffice. We will represent your probability distribution $p(k)$ as a vector $p = (p_1, p_2, \dots, p_n)^T$, where $p_i$ is the probability of $i$.

Now, one of your optimization problem (the maximum cross-entropy) can be written like this: $$ \begin{aligned} \min_{p,q \in \mathbb{R}^n} &\quad \sum_{i=1}^N p_i \ln(q_i) \\ \text{s.t.} &\quad \sum_{i=1}^n p_i = 1 \\ &\quad \sum_{i=1}^n q_i = 1 \\ &\quad p_i \geq 0 \quad \forall i \in \{1, ..., n\} \\ &\quad q_i \geq 0 \quad \forall i \in \{1, ..., n\} \end{aligned} $$

The constraints make sure the vectors $p$ and $q$ indeed represent probability distributions. You cannot use "standard gradient methods" here, since you cannot easily get rid of the constraints. However, you can use other similar methods (e.g. projected gradient) to try and find critical points. Note that the set of constraints is convex, and you can even use existing algorithms to project onto your constraints. Search the web for "projecting onto the simplex".

Also, note that I used the natural logarithm here, since it differs from $\log_2$ by a constant. A constant does not affect the optimal $p,q$. I also converted $\ln(1/q_i)$ into $-\ln(q_i)$. The minus sign went away by transforming the maximization into a minimization problem.