I'm trying to understand how the authors of the paper 'Axiomatic Attribution for Deep Networks' determined the formula for Integrated Gradients.
The path function $\gamma = (\gamma_1, ..., \gamma_n): [0, 1] \rightarrow \mathbb{R}^n$ specifies a smooth path from the baseline $x' \in \mathbb{R}^n$ to the input $x \in \mathbb{R}^n$, where $\gamma(0)=x'$ and $\gamma(1)=x$. The function $F : \mathbb{R}^n \rightarrow [0, 1]$ is represented by a deep neural network in this case. The authors integrate the gradients over the path $\gamma$ with $\alpha \in [0,1]$ using the following line integral:
$$\text{PathIntegratedGrads}_i^\gamma(x)::=\int_{\alpha=0}^1 \frac{\partial F(\gamma(\alpha))}{\partial \gamma_i(\alpha)} \frac{\partial \gamma_i(\alpha)}{\partial \alpha} d\alpha. \tag{1}$$
Now, the authors consider the straight line path:
$$\gamma(\alpha) = x' + \alpha \cdot (x - x') \text{ for } \alpha \in [0, 1]. \tag{2}$$
Plugging the straight line path into the above formula for $\text{PathIntegratedGrads}_i^\gamma(x)$ they get:
$$\text{IntegratedGrads}_i(x)::= (x - x')\int_{\alpha=0}^1 \frac{\partial F(x' + \alpha \cdot (x - x'))}{\partial x_i} d\alpha. \tag{3}$$
Since
$$\frac{\partial \gamma_i(\alpha)}{\partial \alpha} = (x - x'), \tag{4}$$
it follows, that: $$\frac{\partial F(x' + \alpha \cdot (x - x'))}{\partial (x_i' + \alpha \cdot (x_i - x_i'))} \overset{!}{=} \frac{\partial F(x' + \alpha \cdot (x - x'))}{\partial x_i}. \tag{5}$$
However, applying the chain rule, I got:
$$\frac{\partial F(x' + \alpha \cdot (x - x'))}{\partial (x_i' + \alpha \cdot (x_i - x_i'))} = \frac{\partial F(x' + \alpha \cdot (x - x'))}{\partial x_i} \frac{1}{\alpha}. \tag{6}$$
Then shouldn't integrated gradients instead be:
$$\text{IntegratedGrads}_i(x)::= (x - x')\int_{\alpha=0}^1 \frac{\partial F(x' + \alpha \cdot (x - x'))}{ \partial (x_i' + \alpha \cdot (x_i - x_i'))} d\alpha. \tag{7}$$
This is also the way it is implemented in the author's github:
# Scale input and compute gradients.
scaled_inputs = [baseline + (float(i)/steps)*(inp-baseline) for i in range(0, steps+1)]
predictions, grads = predictions_and_gradients(scaled_inputs, target_label_index) # shapes: <steps+1>, <steps+1, inp.shape>
avg_grads = np.average(grads[:-1], axis=0)
integrated_gradients = (inp-baseline)*avg_grads # shape: <inp.shape>
Did I go wrong somewhere or am I missing something?
EDIT: I'm still having some trouble to follow. Let's say we have the path integral and plug in the straight line path, then we get Integrated Gradients:
$$ \text{IntegratedGrads}_i(x)::=\int_{\alpha=0}^1 \frac{\partial F(x' + \alpha \cdot (x - x'))}{\partial (x'_i + \alpha \cdot (x_i - x'_i))} \underbrace{\frac{\partial (x' + \alpha \cdot (x - x'))}{\partial \alpha}}_{(x - x')} d\alpha.\tag{8} $$
Then I have a $\partial (x'_i + \alpha \cdot (x_i - x'_i))$ in the denominator of the integrand, whereas the paper only has a $\partial x_i$. Therefore, equation (8) is different from equation (3), which is the Integrated Gradients formula from the paper. When I rewrite the integrand of Integrated Gradients from the paper with the chain rule I get the following:
$$ \frac{\partial F(x' + \alpha \cdot (x - x'))}{\partial x_i} = \frac{\partial F(x' + \alpha \cdot (x - x'))}{\partial (x'_i + \alpha \cdot (x_i - x'_i))} \underbrace{\frac{\partial (x'_i + \alpha \cdot (x_i - x'_i))}{\partial x_i}}_\alpha, \tag{9} $$ which explains the additional $\alpha$ in (6) and that the LHS and RHS are not equal in equation (5). I also changed the equality sign in equation (5) to emphasize that the LHS and RHS are not the same. What am I missing, such that I get the formula of Integrated Gradients as stated in the paper?
Regarding the interpretation of the math in the paper, I think you have a $∂(x′i+α⋅(xi−x′i))$ term instead of a $∂(xi)$ term in the denominator of the LHS of the equation "it follows, that:".
(I also wonder if there is some confusion about notation: The derivative is taken with respect to a variable[$xi$] at a particular point[$x′i+α⋅(xi−x′i$)].) Regarding the implementation in github repository, I am not seeing the extra α term in the implementation. [If it helps, the derivative returned by the ML library (e.g. tensorflow) corresponds to $∂F(x′+α⋅(x−x′))/∂xi$ and not $∂F(x′+α⋅(x−x′))/∂(x′i+α⋅(xi−x′i)$].
(If you want to do the complete derivation, first use the fundamental theorem of calculus, then the partial derivative chain rule as in case 1 here: http://tutorial.math.lamar.edu/Classes/CalcIII/ChainRule.aspx and the result should match up.)