Why do we use softmax with log-likelihood in deep learning?

52 Views Asked by At

Yes, as we use to say, it would be convenient when we initiate the back-propagation process because of the following formula: $$\nabla_x \ell_\text{LL} (S(x),\, e_k) = S(x) - e_k, \quad x \in \mathbb R^d,$$ where $\ell_\text{LL}(z, e_k) := -\ln z_k$ is the log-likelihood function, $S: \mathbb R^d \to \mathbb R^d$ is the softmax and $e_k$ one-hot with the $k$-th entry $1$.

This phenomenon occurs also when we use sigmoid $\sigma$ with cross entropy loss $\ell_\text{CE}:$ $$\nabla_x \ell_\text{CE} (\sigma(x),\,e_k) = \sigma(x) - e_k, \quad x \in \mathbb R^d.$$

Another simple example arises if we use the Euclidean $l2$-loss without any activation function in the final step: $$\nabla_x \frac {\|x - e_k\|^2} 2 = x - e_k, \quad x \in \mathbb R^d.$$

Okay, so I understand why it is good for backprop to use the softmax and the log-likelihood in a pair. But...

my question is that, is this the only reason we do that?

Is there any other reason?

1

There are 1 best solutions below

0
On BEST ANSWER

This is explained in many places. The short version is that minimizing the cross-entropy loss with softmax output function is equivalent to maximizing the likelihood of the model, which is a natural criterion to use for choosing a good model. See, e.g., https://stackoverflow.com/q/17187507/781723, https://stats.stackexchange.com/q/233658/2921.