While studying Neural Networks, I was bogged with a nasty problem, for which I did not find a satisfying answer using my mathematical knowledge.
Let's assume we have a complex multivariable function, which is directly or indirectly dependent on various variables; such that we can draw an arbitrary directed acyclic graph using the variables which affect the function. To visualize the situation, let's consider the following example diagram:

Here $F$ is a function of $F(X,Y,Z)$, $X$ is a function of $X(B,D)$, $Y$ is a function of $Y(C)$ and $Z$ is a function of $Z(X)$; in general, each variable is dependent on its parents directly.
Much more simple tree diagrams are common in resources about multivariable chain rule and I am familiar with them. But I cannot justify the following problem:
Assume that, in such a hierarchy of variables, where $F$ is the final function which is being evaluated (which is nicely smooth), we want to take the total derivative of $F$ with respect to a variable in the hierarchy; let's name this variable $Q$. By the total derivative, I mean that I am going to perturb $Q$ and look how $F$ is going to change, without holding other variables fixed, in contrast to a partial derivative. So, we are going to evaluate $\dfrac{dF}{dQ}$. I can show by induction that this is equal to $\sum_{i}\dfrac{dF}{dP_i}\dfrac{\partial P_i}{\partial Q}$ where the sum runs on the variables which directly depends on $Q$ and where I evaluate the total derivative with respect to such variables.
Let's assume we have another variable, say $W$, in this hierarchy. For regular partial derivatives, the order of differentiation does not matter. I want to learn that whether this holds for this case as well, where we have $\dfrac{d^2F}{dWdQ}$ and $\dfrac{d^2F}{dQdW}$. If I take the total derivative twice, in different orders, will I obtain the same result? I tried to show this by using induction and succeeded in some simple examples, but I cannot generalize this for an arbitrary hierarchy of variables.
I think matters become a bit more complicated when you differentiate on several variables that depend on each other. The very definition of total derivative is a bit tricky, because the statement "without holding other variables fixed" is a bit vague. You hold fixed any variable that does not depend on $Q$; for the variables that do depend on $Q$, you change them only by their dependency on $Q$ and on other variables that depend on $Q$, holding other dependencies constant.
In any case, I think the statement you're trying to prove is false if $W$ and $Q$ depend on each other. Consider the example of a function $F$ which depends only on a variable $W$, which in turn depends on $Q$ via $W = \sin(Q)$. Then
$$\frac{d^2F}{dQdW} = F''(W)\cos(Q) \quad\text{ whereas }\quad \frac{d^2F}{dWdQ} = F''(W) \cos(Q) + F'(W) \frac{d\cos(Q)}{dW},$$
and $\frac{d\cos(Q)}{dW} = -\frac{\sin(Q)}{\cos(Q)} = -\tan(Q)$.
EDIT: Realizing that the previous example might have been more abstract than needed, here's a very straightforward one: Suppose we have two variables $W$ and $Q$, with the relation $W = e^Q$, and we want to study the function $F(W) = W$.
Clearly, $\frac{dF}{dW} = 1$, and thus $\frac{d^2F}{dQdW} = 0$. On the other hand, $\frac{dF}{dQ} = \frac{d(e^Q)}{dQ} = e^Q = W$ and thus $\frac{d^2F}{dWdQ} = 1$.