What temperature of Softmax layer should I use during neural network training?

10.5k Views Asked by At

I've written GRU (gated recurrent unit) implementation in C#, it works fine. But my Softmax layer has no temperature parameter (T=1). I want to implement "softmax with temperature": $$ P_{i} = \frac{e^{\frac{y_{i}}{T}}}{\sum_{k=1}^{n}e^{\frac{y_{k}}{T}}} $$ but I can not find any answers to my question: should I train my neural network using T=1 (my default training), or I should use some specific value somehow related to value, which I intend to use during sampling?

2

There are 2 best solutions below

1
On

Adding temperature into softmax will change the probability distribution, i.e., being more soft when $T>1$. However, I suspect the SGD will learn this rescaling effects.

0
On

Even with T=1 you have an implicit temperature due to your choice of unit in measuring y or perhaps even a scaling parameter in generating y, you could normalize by dividing with standard deviation. Choice of temperature during training may depend on training set size. In any case you should validate your training against an independent data set and you may use that to tune the temperature choice (and philisophically re-validate temp choice on another set of data).