When can the sigmoid-activated output of a neural network be interpreted as a probability?

436 Views Asked by At

I've created a neural network whose final layer outputs a single sigmoid-activated value. I've trained the network on binary-labeled data (i.e. data labeled either 0 for negative class or 1 for positive class). Normally, when predicting the class of some unlabeled data, I would assume it to be of positive class if the output of the network for that data is above some cutoff (say, 0.5) and of negative class otherwise. However, I want to know whether I can correctly interpret the output of such a network as a probability that a given sample is of positive class.

Since the sigmoid function -- specifically, in this case, the logistic function $$\frac{1}{1 + e^{-x}}$$ -- has range $[0, 1]$, it seems reasonable to interpret its outputs as probabilities, and I've seen a few sources that lead me to think that this is in fact a valid interpretation (e.g. this post), although I'm unsure about why, mathematically, this would be the case and under what conditions this would hold.

1

There are 1 best solutions below

2
On BEST ANSWER

The logistic regression does not give you a real probability it is rather a measure for the confidence of the model (not in the meaning of statistical confidence interval). In order to understand the difference between confidence and probability imagine you have trained a logistic regression as a classifier for predicting if an image does show a snowy or dry road. Assume that we are using the mean pixel value and we are able to discriminate between snowy and dry roads. Now, we take a new image which is showing a wooden floor. We calculate the mean pixel value and put it into the logistic regression. As the mean pixel value will be somewhere between snowy road and dry road the logistic regression will give you some value from the magnitude of $\approx 0.5$. Hence, if we interpret this result as probability this would mean that the logistic regression thinks that chances are $50-50$ for dry road vs. snowy road. But we know this is nonsense. But it is not a surprise because the logistic regression does only give us the confidence of the model, not the probability. This is why logistic regression is called a discriminative model.

In contrast to discriminative models, other models like generative probabilistic models try to model the distribution (often normal distribution) of the mean pixel value for the classes $\mathcal{C}_1$ for dry road and $\mathcal{C}_2$ for the snowy road. If we use such a procedure we will see that we get two distributions of the mean pixel values $x$. The following figure demonstrates the two distributions. If the distributions are very well separated (not much overlap) a new value of the mean pixel value $x$ will result in a low probability for both classes (e.g. imagine the mean pixel value of the wooden floor is at the intersection of both distributions).

enter image description here