I'm trying to understand the worked propositions on this paper (machine learning - reinforcement learning): https://arxiv.org/abs/2007.02832
The authors formulate this objective:
$\hat{g}^*=\arg\max_{\hat{g}\in B}\mathbb{E}_{g'\sim q(g'|\hat{g})}[\Delta H(g')]$
Where:
$\Delta H(g')=p_{ag}(g')\log p_{ag}(g')-(p_{ag}(g')+\eta)\log(p_{ag}(g')+\eta)$
$p_{ag}$ is a PMF and $H$ is the information theory entropy.
They find that:
$\lim_{\eta\to 0}\dfrac{\Delta H(g')}{\eta}=-1-\log p_{ag}(g')$
Which they claim is the same as $\nabla_{p_{ag}}H[p_{ag}]$
$\nabla_{p_{ag}}H[p_{ag}]=-1-\log p_{ag}(g')$
I'm having a hard time trying to understand this last equality. I did not know it was possible to take gradient w.r.t. a probability distribution. I can't make sense of it...
If you are familiar with multidimensional analysis (and in particular gradients) this should be easy.
The entropy can be writen as a function (functional) of a pmf (probability mass function), say :
$$H({\bf p}) = - \sum_x p(x) \log p(x)$$
If you have trouble digesting this, imagine the random variable takes values over four integers $\{1,2,3,4\}$. Then the pmf consists of a set of four numbers: ${\bf p} = (p_1,p_2, p_3,p_4)$. And the entropy $H({\bf p}) = H(p_1,p_2, p_3,p_4)$ will take different values for diferent values of this (multidimensional) variable ${\bf p}$.
Hence, it makes perfect sense to take the derivative wrt to ${\bf p}$. (notice that, even if $x$ is discrete, the values of the variable are continuous).
Actually, for most applications (eg if we are seeking critical points), we must take into account the restriction $\sum p_i =1$ so we usually write the Lagrangian, and hence the derivative (using natural logarithms) is
$$ \frac{\partial}{\partial p_i}(H({\bf p}) + \lambda \sum p_i)= - \log(p_i) - 1 - \lambda$$
The above readily generalizes to variables with infinite support (countable or not)