If logistic is the log odds ratio, what's softmax?

2.7k Views Asked by At

I recently saw a nice explanation of logistic regression: With logistic regression, we want to model the probability of getting success, however you define that in the context of the problem. Probabilities are between 0 and 1, so we can't do a linear regression, but we can still do a linear regression if we wrote the probabilities in an equivalent form whose domain spanned the entire real line. The odds ratio, $\frac{P}{1-P}$, spans from 0 to infinity, so to get the rest of the way, the natural log of that spans from -infinity to infinity. Then we so a linear regression of that quantity, $\beta X = \log{\frac{P}{1-P}}$. When solving for the probability, we naturally end up with the logistic function, $P = \frac{e^{\beta X}}{1 + e^{\beta X}}$.

That explanation felt really intuitive for me, and it nicely explains why the output of the logistic function is interpreted as probabilities. The softmax function, $\frac{e^{x_i}}{\sum_k{e^{x_k}}}$ is supposed to generalize the logistic function to multiple classes instead of just two (success or failure).

Is there a similarly intuitive explanation for why the output of the softmax is a probability and how it generalizes the logistic function? I've seen various derivations, but they don't have the same ring to it that the log odds ratio does.

2

There are 2 best solutions below

3
On BEST ANSWER

I will separate my answer based on your 2 questions:

  1. How does softmax generalizes the logistic function?

As you've stated correctly logistic regression models the probability of success. The problem is that in multilabel classification you do not have a single success, what you would like is to encode the probability of being of a certain class (softmax). To show that it is a generalization, we simply need to realize that the probability of success could also be simply encoded by the probability of seeing in class success and the probability of seeing in class failure. Here's a very loose proof of equivalence ok softmax with K=2 and logistic regression:

$$ \begin{align} \Pr(y_i=1) &= \frac{e^{\theta_1^T x_i}} {~\sum_{0 \leq c \leq 2}^{}{e^{\theta_c^T x_i}}} \\ &= \frac{e^{\theta_1^T x_i }}{e^{\theta_0^T x_i} + e^{\theta_1^T x_i}} \\ &= \frac{1}{e^{(\theta_0-\theta_1)^T x_i} + 1} \\ \end{align} $$

You now simply define $\theta = -(\theta_0-\theta_1)$ and you have the logistic regression :).

  1. What's the intuition behind softmax?

As for logistic regression, there is a simple intuitive explanation. I will approach it from the other way around (from linear regression to softmax, as I find it more intuitive ). The output of your linear regression is between $]-\infty,\infty[$, but we need it to be in $[0,1]$ as we are trying to model a probability of being in a certain class. This can be simply done by taking the exponential of the linear regression: $e^{\theta_{c'}^T x_i}: \ ]-\infty,\infty[ \ \rightarrow ]0,\infty[$. This gives a certain importance weight of each class, to get a probability we simply have to normalize it by the sum of weights for each class: $\frac{e^{\theta_{c'}^T x_i}} {~\sum_{0 \leq c \leq 2}^{}{e^{\theta_c^T x_i}}}: \ ]-\infty,\infty[ \ \rightarrow ]0,1]$. And there you have your probability!

Hope that helps :)

EDIT: You are asking specifically about what are you regressing. I didn't answer this at the beginning because there's not an explanation as clear as for logistic regression (I think of softmax as a map from a linear regression to probabilities). If you really want to understand what you're regressing, you can simply derive it (I'll use $p_c:=\Pr(y_i=c)$ for simplicity ):

$$ \begin{align} p_c &= \frac{e^{\theta_c^T x_i}} {e^{\theta_c^T x_i} + ~\sum_{c' \neq c}^{}{e^{\theta_{c'}^T x_i}}} \\ (1-p_c)e^{\theta_c^T x_i } &= p_c + (~\sum_{c' \neq c}^{}{e^{\theta_{c'}^T x_i}}) \\ \theta_c^T x_i &= log(\frac{p_c}{1-p_c}(~\sum_{c' \neq c}^{}{e^{\theta_{c'}^T x_i}})) \\ \end{align} $$

Using the fact that $log(\sum_i{e^{x_i}}) \approx max(x_i)$:

$$ \begin{align} \theta_c^T x_i &= log(\frac{p_c}{1-p_c}) + max_{c'}(\theta_{c'}^T x_i) \\ (\theta_c^T-max_{c'}(\theta_{c'}^T)) \ x_i &= log(\frac{p_c}{1-p_c}) \\ \end{align} $$

We thus see that the log odds of a single class approximatevely encodes the the difference of your linear regression compared to the regression of the max of all other classes.

0
On

I think the answer above misses the most natural way log-odds induces soft-max. The qualitative behavior described is correct, but I think the more canonical explanation for where soft-max comes is for classes $[K]$ we model the log odds between any pair of classes using a linear function. We use the class $K$ to 'hold out' and be a reference point (since probability distributions are normalized so we only need to determine $K-1$ parameters).

$\log \frac{P(Y=i | \mathbf{w}, \mathbf{x})}{P(Y=K|\mathbf{w},\mathbf{x})}= \mathbf{w_i} \cdot \mathbf{x}$.

We then take the exponential of this term for $i=1,2,...K-1$ and get $P(Y=i |\mathbf{w}, \mathbf{x})= \frac{\exp(\mathbf{w}_i \cdot x)}{1+ \sum_{j=1}^{K-1} \exp(\mathbf{w}_j \cdot x)}$ and $P(Y=k | \mathbf{w}, \mathbf{x})= \frac{1}{1+ \sum_{j=1}^{K-1} \exp(\mathbf{w}_j \cdot x)}$. If we identify $\mathbf{w}_k= \mathbf{0}$ we recover the standard soft max.

a reference for this would be page 119 ESL (elements of statistical learning) Second Edition https://web.stanford.edu/~hastie/Papers/ESLII.pdf

https://en.wikipedia.org/wiki/Multinomial_logistic_regression.

One can also derive soft max directly by a log-linear model (see wikipedia).