I am a beginner at statistics, there are several papers that using mutual information as a measure of variable independence. I know that the $I(X,Y) = 0$ is equivalent to the independence between $X$ and $Y$, but why the mutual information can further measure this independence?
Besides, which distance or divergence measure can be a measure of variable independence? For example, the mutual information is actually the Kullback–Leibler divergence between $P(X,Y)$ and $P(X) \otimes P(Y)$ (from the above wiki link), thus can the total variation distance \begin{equation} \delta(P(X,Y), P(X) \otimes P(Y)) = \max_{X, Y} |P(X,Y) - P(X) \otimes P(Y)| \end{equation} also be a measure of independence?
Saying something "can be treated" as a measure of something else is a pretty vague statement. Usually "treating f as a measure of g" means something like $f(x)>0$ if and only if $g(x)=1$, which as you have noted, holds for the mutual information, provided $f(X,Y)=I(X,Y)$ and $g(X,Y)$ is 1 or 0 depending on whether the variables are dependent. Another justification for using $I$ in this way is that $I(X,Y)$ attains its maximum value exactly when $X$ and $Y$ are related by a deterministic invertible function $X=f(Y)$ and $Y=f^{-1}(X)$. Intuitively this corresponds to a situation where the variables are maximally dependent.