How to compute the optimal values of $\lambda_1,\ldots,\lambda_K$?

160 Views Asked by At

So, here's a question that has come up in my research work. Suppose $P_1$ and $P_2$ are $M\times M$ transition probability matrices such that $P_1\neq P_2$. Further, let $\mu_1$ and $\mu_2$ denote the unique stationary distributions associated with $P_1$ and $P_2$ respectively. For any integer $d\geq 1$, we shall denote by $P_1^d$ the transition probability matrix obtained by multiplying $P_1$ with itself $d$ times. The quantity $P_2^d$ is defined similarly.

We shall now define the Kullback Leibler divergence (KL divergence) between the matrices $P_1^d$ and $P_2^d$ weighted by $\mu_1$ as \begin{equation} D(P_1^d||P_2^d|\mu_1):= \sum\limits_{i=1}^{M}\mu_1(i)\sum\limits_{j=1}^{M}P_1^d(i,j)\log\frac{P_1^d(i,j)}{P_2^d(i,j)}, \end{equation} where $P_1^d(i,j)$ denotes the $(i,j)$th entry of the transition matrix $P_1^d$; the quantity $P_2^d(i,j)$ is defined similarly.

Given a $K$-length probability vector $\lambda=(\lambda_1,\ldots,\lambda_K)$ where $\lambda_i\geq 0$ and $\sum\limits_{i}\lambda_i=1$, we shall define a function $f(\lambda)$ as follows: \begin{equation} f(\lambda):=\sum\limits_{d=1}^{\infty}\bigg[\underbrace{\lambda_1(1-\lambda_1)^{d-1}}_{\text{Geometric distb. with parameter }\lambda_1}\,D(P_1^d||P_2^d|\mu_1)\,+\,\underbrace{\sum\limits_{k=2}^{K}\frac{1}{K-1}\lambda_k(1-\lambda_k)^{d-1}}_{\text{mixture of Geometric distributions}}\,D(P_2^d||P_1^d|\mu_2)\bigg]. \end{equation}

I would like to determine the value of the probability vector $\lambda=(\lambda_1,\ldots,\lambda_K)$ that maximises $f(\lambda)$. As I have pointed out in the above equation, one of the terms represents a Geometric distribution with parameter $\lambda_1$, whereas the other represents a weighted sum of Geometric distributions, where the weight is $1/(K-1)$ for each of the Geometric distributions $\text{Geo}(\lambda_2),\ldots,\text{Geo}(\lambda_K)$.

I know $P_1$ and $P_2$, and thus I can compute the quantities $D(P_1^d||P_2^d|\mu_1)$ and $D(P_2^d||P_1^d|\mu_2)$ at least in principle. In the above, $M\geq 2$ and $K\geq 3$.

I am facing difficulty in finding the value of $\lambda$ that maximises $f(\lambda)$. One approach is to partially differentiate $f(\lambda)$ with respect to $\lambda_i$ for each $i$, and set this partial derivative to $0$. But I do not have any justification for passing the partial derivative inside the infinite summation over the $d$ variable.

One more approach I have tried is to try to use Jensen's inequality to simplify the mixture of Geometric parameters for the function $g(x)=x(1-x)^{d-1}$. However, this function is neither convex nor concave on the whole of $[0,1]$, so I could not proceed further. If at all the application of Jensen's inequality was successful, the $K$-variable optimisation would have come down to optimisation only over $\lambda_1\in[0,1]$. I am basically looking to upper bound $f(\lambda)$, but not able to get anywhere.

Any hints on how I can proceed will be greatly appreciated.

1

There are 1 best solutions below

0
On BEST ANSWER

We will assume that the matrices $P_1,P_2$ are irreducible, aperiodic and do not contain any zeroes.

As a consequence of the Perron-Frobenius theorem, $P_i^d$ converges to $\bf{1}\mu_i$ as $d$ tends to infinity ($\bf{1}$ is a column vector of ones and $\mu_i$ is assumed to be a row vector). Therefore the divergence $D(P_1||P_2|\mu_1)$ converges and $f(\lambda)$ is well defined.

Note that we $f$ can be linearly separated as $f(\lambda)=f_1(\lambda_1)+...+f_K(\lambda_K)$, and that we can independently maximise each of the $f_i$ over its respective variable $\lambda_i$.

Any power series can be differentiated term-wise within its radius of convergence, so passing the partial derivative inside the infinite sum is justified. e.g. for $\lambda>0$

$$\frac{\partial f_1(\lambda)}{\partial \lambda_1}=D(P_1||P_2|\mu_1)+\sum_{d=2}^{\infty}\left[(1-\lambda_1)^{d-1}-(d-1)\lambda_1(1-\lambda_1)^{d-2}\right]D(P_1^d||P_2^d|\mu_1) $$

I can't see any means of obtaining exact solutions to $\frac{\partial f_i}{\partial \lambda_i}=0$ so I would suggest using a numerical method. For instance, you could use Newton's method:

Choose $x_0\in(0,1)$, iterate
$x_{n+1}=x_n-\frac{\left(\frac{\partial f}{\partial \lambda_i}\right)}{\left(\frac{\partial^2 f}{\partial \lambda_i^2}\right)}$, then pick several more $x_0$s and repeat until you think that you have found most of the roots.

You mentioned that you want to find an upper bound for $f(\lambda)$. Proposition 3 in the paper below gives a bound on the rate at which a discrete Markov chain converges to its stationary distribution. It might be possible to use this proposition to bound the rate at which $D(P_1^d||P_2^d|\mu_1)$ converges to $M\sum_{i=1}^M\mu_1(i)^2\log \frac{\mu_1(i)}{\mu_2(i)}$ and hence find an upper bound for $f$.

Douc, R.; Moulines, E.; Rosenthal, Jeffrey S. Quantitative bounds on convergence of time-inhomogeneous Markov chains. Ann. Appl. Probab. 14 (2004), no. 4, 1643--1665. https://projecteuclid.org/euclid.aoap/1099674073