Maximizing the weighted sum of log likelihood function?

206 Views Asked by At

Suppose $\mathbf{X}$ is a iid random binary vector over the sample space $\Omega$ such that $\mathbf{X} \sim Bernoulli(\bar{\mathbf{p}})$. As a matter of notation, suppose $f(\cdot;\bar{\mathbf{p}})$ is the joint probability mass function (pmf) of $\mathbf{X}$, and $f(X_i;p_i)$ is the marginal pmf for the $i^{th}$ random variable for $i=1,2,\dots,N$. By definition, such a joint pmf is continuous and twice differentiable with respect to all its parameters.

Further, suppose, $\Xi \subset \Omega$. Thus,

$$ \sum_{\mathbf{X} \in \mathbf{\Xi}} f(\mathbf{X};\bar{\mathbf{p}}) < 1 \quad \& \quad f(\mathbf{X};\bar{\mathbf{p}}) >0, \quad \forall \mathbf{X} \in \Xi\tag{0}\label{eq0}. $$

I would like to maximize the following function:

$$ \max_{\mathbf{p}}\sum_{\mathbf{X} \in \Xi}f(\mathbf{X};\bar{\mathbf{p}}) \ln f(\mathbf{X};\mathbf{p}) \tag{1}\label{eq1}, $$

where $\bar{\mathbf{p}}$ is the known parameter vector of $f(\cdot;\bar{\mathbf{p}})$, but $\mathbf{p}$ is unknown and needs to be optimized.

My whole purpose in doing so, is to show that,

$$ \sum_{\mathbf{X} \in \Xi} f(\mathbf{X};\mathbf{p}^*) \geq \sum_{\mathbf{X} \in \Xi} f(\mathbf{X};\bar{\mathbf{p}})\tag{2}\label{eq2}, $$

where, $\mathbf{p}^*$ is the maximizer of Eq \eqref{eq1}.

This is what I tried:

The first order condition for $p_i\in \mathbf{p}$ ($p_i$ is the $i^{th}$ element of the vector $\mathbf{p}$):

$$[p_i]: \sum_{\mathbf{X} \in \Xi} \frac{ f(\mathbf{X};\bar{\mathbf{p}})}{ f(\mathbf{X};\mathbf{p})} f'(\mathbf{X};p_i) = \dots = \sum_{\mathbf{X} \in \Xi} \frac{ f(\mathbf{X};\bar{\mathbf{p}})}{ f(\mathbf{X};\mathbf{p})} f'(\mathbf{X};p_j) = 0, \quad \forall p_i,p_j \in \mathbf{p} \tag{2a}\label{eqfoc}.$$

Because of the iid assumption in the beginning $f'(\mathbf{X};p_i) = f'(X_i;p_i) \frac{ f(\mathbf{X};\mathbf{p}) }{f(X_i;p_i)}$, it follows that:

$$[p_i]: \sum_{\mathbf{X} \in \Xi} \frac{ f(\mathbf{X};\bar{\mathbf{p}})}{ f(X_i;p_i)} f'(X_i;p_i) = \dots = \sum_{\mathbf{X} \in \Xi} \frac{ f(\mathbf{X};\bar{\mathbf{p}})}{ f(X_j;p_j)} f'(X_i;p_j) = 0, \quad \forall p_i,p_j \in \mathbf{p}:i \neq j. \tag{2b}\label{eqfocb}$$

which implies that,

$$ \sum_{\mathbf{X} \in \Xi} f'(X_i;p_i) = 0, \quad \forall p_i \in \mathbf{p} \tag{3}\label{eq3},$$

and

$$\frac{ f(\mathbf{X};\bar{\mathbf{p}})}{ f(\mathbf{X};\mathbf{p})} = \lambda (\text{some constant}), \quad \forall \mathbf{X} \in \Xi \tag{4}\label{eq4}.$$

Note that, Eq.\eqref{eq3} holds because of the following:

  1. $\sum_{\mathbf{X} \in \Omega} f(\mathbf{X};\mathbf{p}) = 1 \implies \sum_{\mathbf{X} \in \Omega} f'(\mathbf{X};p_i) = 0, \forall p_i \in \mathbf{p}.$;
  2. We assumed $\Xi \subset \Omega$, over which $f(\mathbf{X};\bar{\mathbf{p}})>0$ by Eq\eqref{eq0}.
  3. $\frac{ f(\mathbf{X};\bar{\mathbf{p}})}{ f(X_i;p_i)}>0$ by definition, and given, we have maximization problem and freedom to set each $p_i \in \mathbf{p}$, $f(X_i;p_i)>0, \forall X_i \in \mathbf{X} \in \Xi$ is not a constraint but an implication of the maximization.
  4. Given $f(X_i;p_i)>0, \forall X_i \in \mathbf{X} \in \Xi$, then, $\sum_{X_i} f(X_i;p_i) = 1, \forall X_i \in \mathbf{X} \in \Xi$ which implies Eq\eqref{eq3}. Otherwise, it would mean that $\Xi \not\subset \omega$.

In turn, it implies that Eq\eqref{eq2}, because Eq\eqref{eq3} is actually, the first order condition for

$$\max_p \sum_{\mathbf{X} \in \Xi} f(\mathbf{X};\mathbf{p}) = \sum_{\mathbf{X} \in \Xi}f(\mathbf{X};\mathbf{p}^*) \geq \sum_{\mathbf{X} \in \Xi} f(\mathbf{X};\bar{\mathbf{p}}). $$

My questions:

  • Is my intuition correct? Is there any mistake?;
  • I am a bit suspicious about Eq\eqref{eq3} and Eq\eqref{eq4}. I suspect there should be additional conditions there to get the latter. Could you help me with those conditions if there needs to be some?;
  • Do you think it is possible to have Eq\eqref{eqfoc} true but not Eq\eqref{eq3} and Eq\eqref{eq4}?
  • What I received is very similar to this post. Thus, I believe the intuition is correct with some constraints on the pmfs missing though.
1

There are 1 best solutions below

7
On BEST ANSWER

Note that we can write the independent Bernoulli law as $f(x;p) = \prod_i p_i^{x_i}(1-p_i)^{1-x_i},$ where $x_i \in \{0,1\}$. I'll write $q$ instead of $\bar p$ to avoid having to type \bar repeatedly (and also I'll forego all the bolding in the question).

Let me denote $$ J(p;q,\Xi) := \sum_{x \in \Xi} f(x;q) \log f(x;p).$$ We're interested in the program $$ J(q;\Xi) := \max_{p \in [0,1]^n} J(p;q,\Xi). $$

There are two slightly different ways to approach this. First if you want to directly take derivatives, then note that you'd need to introduce Lagrange multipliers to deal with the constraints on the entries of $p$, and use their complementary slackness to deal with things. More or less, this will devolve into the following condition, for each $i$. Consider the function $$ U(p_i;q,\Xi) := \sum_{x \in \Xi} \frac{ f(x;q)}{ p_i^{x_i} (1-p_i)^{1-x_i}} (2x_i - 1). $$ If $U(p_i;q,\Xi)$ has a root in $(0,1),$ then the optimal value of $p_i$ is that root. If it does not, then the optimal value is either $0$ or $1$, depending on if $U \le 0$ or $\ge 0$ on $[0,1]$. The form of $U$ comes from exploiting the independence structure embedded in $f$.


But, we can exploit this independence structure from the get go. Notice that $$ J(p;q,\Xi) = \sum_{x \in \Xi} f(x;q) \sum_i \varphi(x_i; p_i) = \sum_i J_i(p_i;q,\Xi),$$ where \begin{align} \varphi(x_i;p_i) &= x_i \log p_i + (1-x_i) \log (1-p_i)\\ J_i(p_i;q,\Xi) &= \sum_{x \in \Xi} f(x;q) \varphi(x_i;p_i). \end{align}

An immediate observation is that the problem is completely decoupled over the $p_i$s, so we might as well understand just a single such problem. Let me further denote $F(\Xi;q) = \sum_{z \in \Xi} f(z;q),$ and the conditional law $f_\Xi(x;q) = f(x;q) \mathbf{1}\{x \in \Xi\}/F(\Xi;q)$. Then multiplying and divding by $F(\Xi;q),$ we have $$ \frac{J_i(p_i;q,\Xi)}{F(\Xi;q)} = \sum_{x} f_\Xi(x;q) (x_i \log p_i + (1-x_i) \log (1-p_i)) \\ = \mu_i(q,\Xi) \log p_i + (1-\mu_i(q;\Xi)) \log(1-p_i), $$ where $$ \mu_i(q;\Xi) := \mathbb{E}_{X \sim q}[X_i|X \in \Xi].$$ Obviously $\mu_i \in [0,1]$. But note that this is just a (negative) cross entropy between binary laws with parameters $\mu_i(q;\Xi)$ and $p_i,$ so the maximiser is precisely $p_i = \mu_i(q;\Xi),$ and the optimal value is $$ -F(\Xi;q) \cdot \sum h_2(\mu_i(q;\Xi)),$$ where $h_2$ is the binary entropy function $h_2(z) = -z \log z - (1-z) \log(1-z)$.