This might be a bit basic but, can someone tell me how did the authors go from the first equation to the next.
Each case in the transfer set contributes a cross-entropy gradient, $dC/dz_i$, with respect to each logit, $z_i$ of the distilled model. If the cumbersome model has logits $v_i$ which produce soft target probabilities $p_i$ and the transfer training is done at a temperature of $T$, this gradient is given by $$\frac{\partial C}{\partial z_i} = \frac{1}{T}(q_i - p_i) = \frac{1}{T}(\frac{e^{z_i/T}}{\sum_je^{z_j/T}} - \frac{e^{v_i/T}}{\sum_je^{v_j/T}}))$$
If the temperature is too high compared with the magnitude of the logits, we can approximate: $$\frac{\partial C}{\partial z_i} \approx \frac{1}{T} (\frac{1+z_i/T}{N+\sum_jz_j/T} - \frac{1+v_i/T}{N+\sum_jv_j/T})$$
This is simply the first order Taylor expansion of the exponential function, $$e^x \approx 1+x,$$ or more precisely $$e^x = 1+x+O(x^2).$$ It holds best for $x$ close to $0$, corresponding to large $T$ in your question. And note for the denominators, $$ \sum_j e^{x_j} \approx \sum_j(1+x_j) = \sum_j1+\sum_j x_j = N+\sum_j x_j. $$