Softmax function and modelling probability distributions

3.6k Views Asked by At

Hinton in his neural network course on Coursera says that "Any probability distribution P over discrete states (P(x) > 0 for all x) can be represented as the output of a softmax unit for some inputs." I was trying to prove it but didn't manage. Do you know of a proof of this?

1

There are 1 best solutions below

0
On BEST ANSWER

Let us first define a few things:

  • Suppose, without loss of generality, that the random variable whose probability distribution we are considering can take the $n$ different values $1,2,\ldots,n$ with respective probabilities $\{p_k\}_{k=1,\ldots,n}$.
  • Consider a softmax activation unit, which takes a vector $x\in\mathbb{R}^m$ ($m\geq n$) as input and outputs $n$ values $g_k(x)$, $k=1,\ldots,n$, where the $w_k$ are the weights of the node. More specifically, for any $k\in\{1,\ldots,n\}$, we have: \begin{equation} g_k(x)=\frac{e^{w_k^T x}}{\sum_{j=1}^n e^{w_j^T x}} \end{equation}

Essentially, what we are trying to do is determine weight vectors $\{w_k\}_{k=1,\ldots,n}$ and one input vector $x$ such that we have: \begin{equation} g_k(x)=p_k,\quad \forall k\in\{1,\ldots,n\} \end{equation}

Take the logarithm of the previous equality to obtain: \begin{equation} \ln p_k=w_k^T x-\ln z,\quad \forall k=1,\ldots,n \end{equation} where $z=\sum_{j=1}^n e^{w_j^T x}$.

In order to get rid of this $z$, we choose one of the possible values as our "pivot". Suppose we choose $n$ as a pivot, by substracting the equality corresponding to $n$ from all other equalitities we can then re-write them as: \begin{align} \ln \frac{p_1}{p_n}&=(w_1-w_n)^T x\\ \ln \frac{p_2}{p_n}&=(w_2-w_n)^T x\\ \vdots\\ \ln \frac{p_{n-1}}{p_n}&=(w_{n-1}-w_n)^T x \end{align} Since we have extra degrees of freedom, we can arbitrarily choose to set $w_n=0$, which yields: \begin{align} \ln \frac{p_1}{p_n}&=w_1^T x\\ \ln \frac{p_2}{p_n}&=w_2^T x\\ \vdots\\ \ln \frac{p_{n-1}}{p_n}&=w_{n-1}^T x \end{align}

Letting $q=\begin{pmatrix}\ln \frac{p_1}{p_n}\\ \vdots\\ \ln \frac{p_{n-1}}{p_n} \end{pmatrix}$ and $W=\begin{pmatrix} \leftarrow &w_1 & \rightarrow \\ &\vdots & \\\leftarrow &w_{n-1} & \rightarrow \end{pmatrix}$, these equations can be written in matrix form as: \begin{equation} q=Wx \end{equation} Now, choose any set of $n-1$ vectors $w_k$, $k=1,\ldots,n-1$, so that the matrix $W$ has a column rank of $n-1$. Without loss of generality, assume that the first $n-1$ columns of $W$ are independent and let $\tilde{W}$ be the matrix composed of those $n-1$ columns. We can then solve for $x$ by letting the first $n-1$ components of $x$ be given by $\tilde{W}^{-1} q$ and letting the remaining components be zero. By construction, we will then satisfy $q=Wx$, which implies that we have $g_k(x)=p_k$ for all $k$, as desired.

To sum up, following this procedure, we will have constructed a set of weight vectors $\{w_k\}_{k=1,\ldots,n}$, with $w_n=0$, and an input vector $x\in\mathbb{R}^m$ such that the first $n-1$ components of $x$ are given by $\tilde{W}^{-1}q$ (where $\tilde{W}$ is any invertible matrix of size $n-1$) and the remaining components are null. These $w_k$ and $x$ will be such that the outputs $g_k(x)$ of the softmax unit satisfy $g_k(x)=p_k$ for all $k$.