When reading Katz's paper "Algebraic Solutions of Differential Equations (p-Curvature and the Hodge Filtration)", he mentioned a mysterious identity about derivations in char $p$ commutative algebras at Page $3$:

Namely, If $A$ is a commutative algebra over $\Bbb F_p$, $D$ is a derivation of $A$ (i.e linear and satisfying Leibniz rules ), then $D^{p-1}(X^{p-1} DX)+(D(X))^p=X^{p-1}D^p(X)$ for any $X \in A$. I check it for polynomial rings, why does it hold in such generality? Hochschild's original paper does not state this as a lemma but only regard it as a technical identity. How to prove it in modern language?
The case of any commutative algebra $A$ follows automatically from the case of a polynomial algebra. It is likely that the proof for polynomial algebras, is also a proof for other algebras, but I'll still explain why one follows from the other below.
Rewrite the formula
First of all, let's rewrite the formula in more compact form:
$$(DX)^p = [X^{p-1},D^{p-1}]DX$$
I guess this notation is a little abusive because $[X^{p-1}, D^{p-1}]$ means the operator commutator $\mu_X^{p-1}\circ D^{p-1} - D^{p-1}\circ \mu_X^{p-1}$ where $\mu_X(Y) = X\cdot Y$ is multiplication by $X$.
Reduction of "general case" to polynomial algebra case.
Choose a surjective k-algebra homomorphism $\varphi: k[X] \to A$ from a polynomial algebra to $A$, and a $k$-linear section $\bar{\varphi}^{-1}: A \to k[X]$, i.e. a linear map satisfying $\varphi \circ \bar \varphi^{-1} = \text{id}_A$.
Now suppose you know the identity for all derivations in $\text{Der}_k(k[X],k[X])$. Given $D \in \text{Der}_k(A,A)$, you can choose a (not unique) lift $\tilde D \in \text{Der}_k(k[X],k[X])$ such that $D = \varphi \circ \tilde D \circ \bar\varphi^{-1}$. (Finding such a $\tilde D$ is not hard, it more or less amounts to the choice of section $\bar\varphi^{-1}$ we already made, because now we can just take $\tilde D := \bar\varphi^{-1} \circ D \circ \varphi$.)
Now we have identities like $D^n = \varphi \circ \tilde D^n \circ \bar\varphi^{-1}$ and can basically transport the equation from $A$ to $k[X]$ where we know it is true, and back. You will also need to use that $\varphi$ is a ring homomorphism and $\varphi\circ \bar\varphi^{-1} = \text{id}_A$ of course. Anyway, you probably see where this is going but I'll write it out just in case:
\begin{align*} (DX)^p - [X^{p-1},D^{p-1}]DX &= \varphi \big(\tilde D (\bar\varphi^{-1}X)\big)^p - \varphi[\bar\varphi^{-1}X^{p-1}, \tilde D^{p-1}]\tilde D(\bar\varphi^{-1}X)\\ &= \varphi\bigg(\big(\tilde D(\bar\varphi^{-1}X)\big)^p - [(\bar\varphi^{-1}X)^{p-1},\tilde D^{p-1}]\tilde D(\bar\varphi^{-1}X)\bigg)\\ &= \varphi(0) = 0 \end{align*}
Case of $p=2$.
Second of all, let's observe the formula in the easiest case $p=2$. Namely, \begin{align} [D,X](Y) &= D(XY) - XDY\\ &= XDY + YDX - XDY\\ &= YDX, \end{align} so $[D,X]$ is the operator of multiplication by $DX$. Thus $[D,X](DX) = DX\cdot DX = (DX)^2$ as desired.
Case of $p=3$.
Now try a less trivial, still easily computable case.
\begin{align} [D^2,X^2](Y) &= D^2(X^2Y) - X^2D^2Y\\ &= D(2XYDX + X^2DY) - X^2D^2Y\\ &= 2D(XYDX) + 2XDX\cdot DY\\ &= 4XDXDY + 2Y(DX)^2 +2XYD^2X \end{align}
This might not look like much but when $Y = DX$ the first and last terms combine and cancel and you are left with $2DX(DX)^2 = - (DX)^3$ as desired.
Now there is a general approach: expand $[X^{p-1},D^{p-1}]Y$, and match terms that become the same when setting $Y = DX$. Compute their coefficients mod $p$ and make sure they all cancel, except the middle term. Finally make sure the coefficient on the middle term is 1.