Derivative of a loss function with respect to learning rate.

215 Views Asked by At

I do not understand completely why learning rate is not a trainable parameter in itself. Say we want to calculate the gradient of the loss with respect to the learning rate of our last update, and we assume that we can update this approximately like so: $$\alpha^{T+1} = \alpha^T - \frac{\delta L^{T+1}}{\delta \alpha^T}$$ Where $\alpha$ denotes the learning rate and $L$ the loss function. Then we can calculate the following: $$\frac{\delta L^{T+1}}{\delta \alpha^{T}} = \frac{\delta L^{T+1}}{\delta y^{T+1}}\frac{y^{T+1}}{\delta\alpha^{T}} =\frac{\delta L^{T+1}}{\delta y^{T+1}}\frac{\delta y^{T+1}}{\delta\Theta^{T+1}}\frac{\delta\Theta^{T+1}}{\delta\alpha^T}$$ by using the chain rule. We do know what $\frac{\delta L^{T+1}}{\delta y^{T+1}}$, simply the derivative of the loss function with respect to the output of the network. Assuming we calculate $y^{T+1}$ first, before even trying to calculate $\alpha^{T+1}$, we also know what $\frac{\delta y^{T+1}}{\delta\Theta^{T+1}}$ is from the backpropagation step. If we use normal gradient descent on our network parameters $\Theta$, then $$\Theta^{T+1} = \Theta^T - \alpha^{T}\frac{\delta L^{T}}{\delta\Theta^{T}}$$ Because of this $$\frac{\Theta^{T+1}}{\delta\alpha^T} = -\frac{\delta L^{T}}{\delta\Theta^T}$$.

Therefore $$\frac{\delta L^{T+1}}{\delta\alpha^T} = -\frac{\delta L^{T+1}}{\delta y^{T+1}}\frac{\delta y^{T+1}}{\delta\Theta^{T+1}}\frac{\delta L^{T}}{\delta\Theta^T}$$ Assuming we are in step $T+1$ already, all of these would be known. Couldn't we calculate this with a bit of "delay" assuming this doesnt f*** everything up for us, basically "dragging" the learning rate a bit behind in hopes, that the target surface is smooth enough? I have not tried it myself yet. Can anyone pinpoint me in a direction that tries to solve this directly instead of using second order derivatives to calculate a momentum term?