Knowledge Distillation math proof

Question

Knowledge Distillation math proof

476 Views Asked by Bumbble Comm At 30 Mar 2026 - 8:48

In the paper "Distilling the Knowledge in a Neural Network" by Hinton, soft target of student model is defined as cross-entropy $C$ between teacher model and student model. Assume that $i$ is an integer, $i \in [1, N]$, where $N$ is number of class models are trained to classify. In section 2.1, the paper is written as follow:

Each case in the transfer set contributes a cross-entropy gradient, $dC/dz_i$, with respect to each logit, $z_i$ of the distilled model. If the cumbersome model has logits $v_i$ which produce soft target probabilities $p_i$ and the transfer training is done at a temperature of $T$, this gradient is given by:

$$ \frac{\partial C}{\partial z_i} = \frac{1}{T}(q_i - p_i) = \frac{1}{T}(\frac{e^{z_i/T}}{\sum_j e^{z_j/T}} - \frac{e^{v_i/T}}{\sum_j e^{v_j/T}}) \tag{2} $$

If the (softmax) temperature is high compared with the magnitude of the logits, we can approximate:

$$ \frac{\partial C}{\partial z_i} \approx \frac{1}{T}\left(\frac{1 + z_i/T}{N + \sum_j z_j/T} - \frac{1 + v_i/T}{N + \sum_j v_j/T}\right) \tag{3} $$

If we now assume that the logits have been zero-meaned separately for each transfer case so that $\sum_j z_j = \sum_j v_j = 0$ Eq. 3 simplifies to:

$$ \frac{\partial C}{\partial z_i} \approx \frac{1}{NT^2} (z_i - v_i) \tag{4} $$

So in the high temperature limit, distillation is equivalent to minimizing $$ \frac{1}{2}(z_i − v_i)^2 \tag{5} $$ , provided the logits are zero-meaned separately for each transfer case.

I believe this is a good paper, but it skipped so many steps that it is hard for a beginner like me to understand.

I already manage to get Eq. 2 by using cross entropy, and my problems are Eq. 3 and Eq. 5. For Eq.3, I tried to use $\lim_{T\to\infty}e^{z_i/T} = \lim_{T\to\infty}1+z_i/T=1$, but I'm not sure I am correct or not. For Eq.5, I just don't know how to get the equation.

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

For equation $(3)$, they use the approximation that

$$e^{x}\approx 1+x$$

when $x$ is small. Here $x$ are $\frac{z_i}{T}$ and also $\frac{v_i}{T}$.When $T$ is huge, this approximation is good.

For equation $(5)$, the quadratic equation attains the minimum when $z_i=v_i$ which is the same as setting equation $(4)$ to be $0$.

Knowledge Distillation math proof

There are 1 best solutions below

Related Questions in MACHINE-LEARNING

Trending Questions

Popular # Hahtags

Popular Questions