TL;DR: What happens with Gibb's distribution when $\beta \to \infty $ and why? $$ \lim_{\beta \to \infty} \frac{\exp(-\beta E(W, Y^i, X^i))}{\int_y \exp(-\beta E(W, y, X^i)) } \ = \ ? $$
Full question
I asked this on stats.stackexchange before but got no answer, so I'll try again over here, since it is more of a math question anyway, but applied to machine learning.
In "A tutorial on energy-based learning" (LeCun et al., 2006), on page 15, section 2.2.4 about the Negative Log-Likelihood Loss, is written: "Interestingly, the NLL loss reduces to the generalized perceptron loss when $\beta \to \infty $ (zero temperature),...".
This implies:
$$ \lim_{\beta \to \infty} \big( \frac{1}{\beta} \, \log \int_y \exp (- \beta E(y) \big ) \ = \ - \min_{y \in \mathcal{Y}} E(y) $$
Where $ E(Y^i) $, an energy function, is actually $ E(W, Y^i, X^i) $, but shortened for clarity. I'm having trouble understanding exactly why. I've tried two roads. The first one is applying l'Hospital's rule for $\frac{\infty}{\infty}$ to the LHS, which (if I didn't make mistakes) gives:
$$ \lim_{\beta \to \infty} \big( \frac{1}{\beta} \, \log \int_y \exp (- \beta E(y)) \big ) \ = \ - \int_y E(y) \lim_{\beta \to \infty} \frac{\exp(- \beta E(y))}{ \int_y \exp(- \beta E(y)) } $$
The fraction in the integral looks like a probability (Gibbs distributed), so the entire integral seems like some kind of mean value of $E$ over $\mathcal{Y}$ ? Only for $\beta \to \infty$, which I don't know how to interpret. Applying l'Hospital's rule again is not allowed (I think?) because the limit of the derivative of the denominator $= 0$, and it doesn't bring me any further anyway, and neither does substituting $\alpha$ for $\exp(-\beta)$. The second way of starting was inspired by the generalized mean (wiki), which approaches the maximum for $p \to \infty$:
$$ \lim_{\beta \to \infty} \big( \frac{1}{\beta} \, \log \int_y \exp (- \beta E(y) \big ) \\ = \log \, \lim_{\beta \to \infty} \big( \int_y \exp(-E(y)) ^ \beta \big)^{1/\beta} \\ = \log \, \lim_{\beta \to \infty} \, \exp(- \min_y E(y)) \big( \int_y \big( \frac{\exp(-E(y))}{\exp(- \min_y E(y))} \big)^{\beta} \big)^{1/\beta} $$ If the last integral were a discrete sum over all $ y \in \mathcal{Y}$, all terms would be $<1$ and hence go to $0$ for $\beta \to \infty$, except for the term were $E(y) = min_y E(y)$, which equals $1$, and the expression would then become:
$$ \log \, \exp( -\min_y E(y)) \, \lim_{\beta \to \infty} (...) \\ = \log \, \exp( -\min_y E(y)) \\ = - \min_{y \in \mathcal{Y}} E(y) $$
Which is what we were looking for. However, I don't know if extracting $ \exp( -\min_y E(y)) $ out of the integral like I'd do with a sum is correct, and if the limit would then equal $1$, as it would for a sum. Could someone perhaps explain this?
Was applying l'Hospital's rule correct? Could we have gotten to the result following that way of starting as well? And could someone shed some light on what happens with the Gibb's distribution when $\beta \to \infty$, or, as in physics, when $T \to 0$? I can't find a clear and concise explanation on internet.
Many thanks in advance :).