In's commonly said that in VAE, we use reparameterization trick because "we can't backpropagate through stochastic node"

It makes sense from the picture, but I found it hard to understand exactly what it means and why. Let's say X ~ N(u, 1).
And we want to compute $$\frac{d X}{d u}$$ which is not possible because the sampling operation is non-differentiable. That is, we don't know how changing u a little bit will affect how we got the sample X.
However, say in the MLE for Gaussian. We are trying to estimate the following quantity:
$$\sum_{i=1}^N \log p(X_i;u)$$ for which the derivative $$\frac{d \log p(X_i ; u)}{d u}$$ can be easily calculated. My confusion comes from the fact that $$\frac{d \log p(X_i ; u)}{d u} = \frac{d \log p(X_i ; u)}{d X_i} \frac{d X_i}{d u}$$ by the chain rule. If we can't compute $\frac{d X_i}{d u}$, why can we compute $\frac{d \log p(X_i ; u)}{d u}$ ?
I think your maximum likelihood equation is not correct. In particular, if I understand your setting correctly $X_i$ is your data, which does not depend on $u$. Their log-likelihood does depend on $u$ since for a constant variance it's essentially the MSE $\sum_i (X_i-u)^2$ multiplied and added with terms that don't depend on $u$. Then taking $\sum_i \frac{d\log p(X_i;u)}{du} = O(1)\sum_i (X_i-u) + O(1)$, where $O(1)$ are terms that don't depend on $X_i$ nor $u$.
Notice the final expression depends on $X_i$ but not because we differentiated through it. This is because $dX_i/du$ because is zero, since it's the data and does not depend on $u$.