Why the magnitudes of the gradients produced by the soft targets scale as $1/T^2$ in knowledge distillation?

73 Views Asked by At

In the paper "Distilling the Knowledge in a Neural Network", they claim that "the magnitudes of the gradients produced by the soft targets scale as $1/T^2$ it is important to multiply them by $T^2$ when using both hard and soft targets".

In section 2.1, they write:

Each case in the transfer set contributes a cross-entropy gradient, $dC/dz_i$, with respect to each logit, $z_i$ of the distilled model. If the cumbersome model has logits $v_i$ which produce soft target probabilities $p_i$ and the transfer training is done at a temperature of $T$, this gradient is given by: $$ \frac{\partial C}{\partial z_i} = \frac{1}{T}(q_i - p_i) = \frac{1}{T}(\frac{e^{z_i/T}}{\sum_j e^{z_j/T}} - \frac{e^{v_i/T}}{\sum_j e^{v_j/T}}) \tag{2} $$ If the (softmax) temperature is high compared with the magnitude of the logits, we can approximate: $$ \frac{\partial C}{\partial z_i} \approx \frac{1}{T}\left(\frac{1 + z_i/T}{N + \sum_j z_j/T} - \frac{1 + v_i/T}{N + \sum_j v_j/T}\right) \tag{3} $$ If we now assume that the logits have been zero-meaned separately for each transfer case so that $\sum_j z_j = \sum_j v_j = 0$ Eq. 3 simplifies to: $$ \frac{\partial C}{\partial z_i} \approx \frac{1}{NT^2} (z_i - v_i) \tag{4} $$

For hard targets, a cross-entropy gradient $dC/dz_i$, with respect to each logit, $z_i$ of the student model is $$ \frac{\partial C}{\partial z_i} = (q_i - 1) = (\frac{e^{z_i}}{\sum_j e^{z_j}} - 1) \tag{5} $$

I can see that the $1/T^2$ term appears in Eq. 4, but why do they claim that the magnitudes of the gradients produced by the soft targets scale as $1/T^2$? I can't see why Eq. 4 is $1/T^2$ of Eq. 5.