Gradient with respect to a probability mass function

197 Views Asked by Bumbble Comm At 01 Apr 2026 - 9:58

I'm trying to understand the worked propositions on this paper (machine learning - reinforcement learning): https://arxiv.org/abs/2007.02832

The authors formulate this objective:

$\hat{g}^*=\arg\max_{\hat{g}\in B}\mathbb{E}_{g'\sim q(g'|\hat{g})}[\Delta H(g')]$

Where:

$\Delta H(g')=p_{ag}(g')\log p_{ag}(g')-(p_{ag}(g')+\eta)\log(p_{ag}(g')+\eta)$

$p_{ag}$ is a PMF and $H$ is the information theory entropy.

They find that:

$\lim_{\eta\to 0}\dfrac{\Delta H(g')}{\eta}=-1-\log p_{ag}(g')$

Which they claim is the same as $\nabla_{p_{ag}}H[p_{ag}]$

$\nabla_{p_{ag}}H[p_{ag}]=-1-\log p_{ag}(g')$

I'm having a hard time trying to understand this last equality. I did not know it was possible to take gradient w.r.t. a probability distribution. I can't make sense of it...

Original Q&A

There are 1 best solutions below

Bumbble Comm On 14 Jun 2021 - 8:00 BEST ANSWER

If you are familiar with multidimensional analysis (and in particular gradients) this should be easy.

The entropy can be writen as a function (functional) of a pmf (probability mass function), say :

$$H({\bf p}) = - \sum_x p(x) \log p(x)$$

If you have trouble digesting this, imagine the random variable takes values over four integers $\{1,2,3,4\}$. Then the pmf consists of a set of four numbers: ${\bf p} = (p_1,p_2, p_3,p_4)$. And the entropy $H({\bf p}) = H(p_1,p_2, p_3,p_4)$ will take different values for diferent values of this (multidimensional) variable ${\bf p}$.

Hence, it makes perfect sense to take the derivative wrt to ${\bf p}$. (notice that, even if $x$ is discrete, the values of the variable are continuous).

Actually, for most applications (eg if we are seeking critical points), we must take into account the restriction $\sum p_i =1$ so we usually write the Lagrangian, and hence the derivative (using natural logarithms) is

$$ \frac{\partial}{\partial p_i}(H({\bf p}) + \lambda \sum p_i)= - \log(p_i) - 1 - \lambda$$

The above readily generalizes to variables with infinite support (countable or not)

Gradient with respect to a probability mass function

There are 1 best solutions below

Related Questions in PROBABILITY-DISTRIBUTIONS

Related Questions in MACHINE-LEARNING

Related Questions in INFORMATION-THEORY

Related Questions in ENTROPY

Trending Questions

Popular # Hahtags

Popular Questions