I'm trying to find the derivate of the Shannon entropy for discrete distributions, i.e. the derivative of:
$H(P)=-\sum_{i=0}^n p_i * log(p_i)$
I didn't have much trouble finding the solution for the binary case, using $p_1 = 1-p_0$ such that: $H(p_0) = -p_0 * log(p_0) - (1-p_0) * log(1 - p_0)$
$H'(p_0) = log(1-p_0) - log(p_0))$
However, I'm not sure how to deal with the constraint that $\sum_{i=0}^n p_i=1 $ in the general case. Obviously, computing a partial derivative under the assumption that $p_i$ and $p_j$ are independent if $i\ne j$ leads to a meaningless result. I'd appreciate any tips on how this should be approached.
Edit: Derivative under the assumption that all probabilities are independent from each other: $H(p_i)=-(log(p_i)-1)/ln(2)$
TL;DR: If you want to compute a total derivative where your function input is a probability vector, you need to specify the "path" of the derivative ---i.e., as you increase one of the probability values, what happens to the other values in order to preserve the summation constraint?
Now, suppose you look at the total derivative with respect to $p_k$, where you assume that as you increase this value you decrease every other value proportionately to its existsing value. This gives the first-order total derivatives:
$$\frac{dH}{dp_k}(\mathbf{p}) = - \frac{H(\mathbf{p}) + \log(p_k)}{1-p_k},$$
and the second-order total derivatives:
$$\frac{d^2 H}{dp_k dp_\ell}(\mathbf{p}) = \begin{cases} - 1/[p_k (1-p_k)] & & & \text{for } k = \ell, \\[12pt] 1/[(1-p_k)(1-p_\ell)] & & & \text{for } k \neq \ell. \\[12pt] \end{cases}$$
This is just one possible form for the total derivative, taken over this particular path. Nevertheless, this is a simple path derivative with a natural interpretation, and it gives a simple form for the total derivative.
Longer answer: This is a case where you have a differentiable multivariate function and you are looking for the rate-of-change with respect to one element, subject to a constraint on on all the elements going into the function. In a case like this we can use the total derivative to measure the rate-of-change, but we need to specify how the inputs are changing to satisfy the constraint. For example, suppose you want to look at the rate-of-change of the element $p_k$, which means you want to know how the function changes when you increase this element by a small amount. When this element changes by a small amount, you will need to specify how the other elements are changing so that the constraint remains satisfied. In doing so, you specify a "path"over the space of input vectors that satisfy the constraint and you get a derivative as you move along that path.
There are an infinite number of ways you could do this, corresponding to the infinite number of possible paths over the constrained space for the input vector. However, in the present case there is a fairly natural way that leads to a simple interpretation, so I will show you the total derivative for this case. For simplicity, I will look only at values of $\mathbf{p}$ that are in the interior of the simplex, so none of the elements are zero or one. To begin with, we find the gradient vector of the entropy function, which is:
$$\nabla H(\mathbf{p}) = - \begin{bmatrix} 1+\log(p_0) \\ 1+\log(p_1) \\ \vdots \\ 1+\log(p_n) \\ \end{bmatrix}.$$
In order to take the total derivative subject to the constraint, suppose we increase the element $p_k$ up to the value $p_k + dp_k$ for some infinitesimally small value $dp_k$. As we increase this element, we decrease all the other elements proportionately to their values so that the constraint holds. Thus, for some infinitesimal value $dr$ we update our input vector to:
$$\begin{aligned} p_0 &\mapsto p_0 (1 - dr) \\ p_1 &\mapsto p_1 (1 - dr) \\ & \ \ \ \vdots \\ p_{k-1} &\mapsto p_{k-1} (1 - dr) \\ p_k &\mapsto p_k + dp_k \\ p_{k+1} &\mapsto p_{k+1} (1 - dr) \\ & \ \ \ \vdots \\ p_n &\mapsto p_n (1 - dr) \\ \end{aligned}$$
Applying the constraint to the updated point we have:
$$\begin{aligned} 1 = \sum_{i=0}^n p_i^* &= (p_k + dp_k) + (1 - dr) \sum_{i \neq k} p_i \\[6pt] &= (p_k + dp_k) + (1 - dr) (1-p_k) \\[12pt] &= 1 + dp_k - (1-p_k) dr. \\[6pt] \end{aligned}$$
Solving this constraint equation gives $dr = dp_k/(1-p_k)$ so we have $dp_i = -p_i dr = - p_i dp_k/(1-p_k)$ for all $i \neq k$. This means that we have:
$$\frac{dp_i}{dp_k} = - \frac{p_i}{1-p_k} \quad \quad \text{for all } i \neq k.$$
Thus, the total derivative with respect to $p_k$ is:
$$\begin{aligned} \frac{dH}{dp_k}(\mathbf{p}) &= \sum_{i=1}^n \frac{\partial H}{\partial p_i}(\mathbf{p}) \cdot \frac{dp_i}{dp_k} \\[6pt] &= \frac{\partial H}{\partial p_i}(\mathbf{p}) - \sum_{i \neq k} \frac{\partial H}{\partial p_i}(\mathbf{p}) \cdot \frac{p_i}{1-p_k} \\[6pt] &= -(1 + \log(p_k)) + \frac{1}{1-p_k} \sum_{i \neq k} p_i (1 + \log(p_i)) \\[6pt] &= -(1 + \log(p_k)) - \frac{p_k (1 + \log(p_k))}{1-p_k} + \frac{1}{1-p_k} \sum_{i=0}^n p_i (1 + \log(p_i)) \\[6pt] &= -(1 + \log(p_k)) - \frac{p_k (1 + \log(p_k))}{1-p_k} + \frac{1 - H(\mathbf{p})}{1-p_k} \\[6pt] &= - \frac{1+\log(p_k)}{1-p_k} + \frac{1 - H(\mathbf{p})}{1-p_k} \\[6pt] &= - \frac{H(\mathbf{p}) + \log(p_k)}{1-p_k}. \\[6pt] \end{aligned}$$
The second-order total derivative with respect to $p_k$ is:
$$\begin{aligned} \frac{d^2 H}{dp_k^2}(\mathbf{p}) &= \frac{d}{dp_k} \frac{dH}{dp_k}(\mathbf{p}) \\[6pt] &= - \frac{d}{dp_k} \frac{H(\mathbf{p}) + \log(p_k)}{1-p_k} \\[6pt] &= - \frac{1}{1-p_k} \Big[ \frac{d}{dp_k} (H(\mathbf{p}) + \log(p_k)) + \frac{H(\mathbf{p}) + \log(p_k)}{1-p_k} \Big] \\[6pt] &= - \frac{1}{1-p_k} \Big[ \frac{dH}{dp_k} (\mathbf{p}) + \frac{1}{p_k} - \frac{dH}{dp_k} (\mathbf{p}) \Big] \\[6pt] &= - \frac{1}{p_k (1-p_k)}. \\[6pt] \end{aligned}$$
The second-order total derivative with respect to $p_k$ and $p_\ell$ (with $k \neq \ell$) is:
$$\begin{aligned} \frac{d^2 H}{dp_k dp_\ell}(\mathbf{p}) &= \frac{d}{dp_\ell} \frac{dH}{dp_k}(\mathbf{p}) \\[6pt] &= - \frac{d}{dp_\ell} \frac{H(\mathbf{p}) + \log(p_k)}{1-p_k} \\[6pt] &= - \frac{1}{1-p_k} \Big[ \frac{d}{dp_\ell} (H(\mathbf{p}) + \log(p_k)) + \frac{H(\mathbf{p}) + \log(p_k)}{1-p_k} \frac{d p_k}{d p_\ell} \Big] \\[6pt] &= - \frac{1}{1-p_k} \Big[ \frac{d}{dp_\ell} (H(\mathbf{p}) + \log(p_k)) - \frac{dH}{dp_k} (\mathbf{p}) \cdot \frac{d p_k}{d p_\ell} \Big] \\[6pt] &= - \frac{1}{1-p_k} \Big[ \frac{d H}{dp_\ell} (\mathbf{p}) + \frac{1}{p_k} \cdot \frac{d p_k}{d p_\ell} - \frac{dH}{dp_k} (\mathbf{p}) \cdot \frac{d p_k}{d p_\ell} \Big] \\[6pt] &= - \frac{1}{1-p_k} \Big[ \frac{d H}{dp_\ell} (\mathbf{p}) + \frac{1}{p_k} \cdot \frac{d p_k}{d p_\ell} - \frac{dH}{dp_\ell} (\mathbf{p}) \Big] \\[6pt] &= - \frac{1}{p_k (1-p_k)} \cdot \frac{d p_k}{d p_\ell} \\[6pt] &= \frac{1}{p_k (1-p_k)} \cdot \frac{p_k}{1-p_\ell} \\[6pt] &= \frac{1}{(1-p_k)(1-p_\ell)}. \\[6pt] \end{aligned}$$
We can see from the derivatives that the entropy is decreasing in $p_k$ when $\log(p_k) > -H(\mathbf{p})$ and is decreasing in $p_k$ when $\log(p_k) < -H(\mathbf{p})$. This makes intuitive sense; if the element is already "larger than average" then increasing it decreases the entropy and if it is "lower than average" then increasing it increases the entropy.
It is important to stress that this is just one possible form for the total derivative, which hinges on our specification of the "path" for the input vector when we change the element $p_k$. In this case we have specified that an increase in $p_k$ is accompanied by a contemporaneous decrease in the other elements (and vice versa) where those other elements change proportionately to their existing values. If you were to specify a different "path" for the change in vector corresponding to a change in a single element, then this would lead to a different form for the total derivative.