cross entropy for binary or multiclass classification

83 Views Asked by At

I'm building a NN classifier to predict if a sample is of class 1 or 0. I'm trying 3 differents network configuration:

  1. One unit in the output layer with sigmoid activation function

  2. Two units in the output layer with sigmoid activation function

  3. Two units in the output layer with softmax activation function

Now, I'm confused on how I shall compute the cross entropy loss in each of those three cases. I found two formulas. One for binary classification (1 unit in the output layer), the other for multiclass classification:

  1. $$loss1 = \frac{-1}{m} \left( \sum _{i=0}^m \sum_{k=0}^Ky_k^{(i)}\log(\hat y^{(i)}_k) + (1-y_k^{(i)})\log(1-\hat y_k^{(i)})\right)$$

  2. $$loss2 = \frac{-1}{m} \left(\sum_{i=0}^m\sum_{k=0}^Ky_k^{(i)}\log(\hat y^{(i)}_k) \right)$$
    with m being the number of sample in the train set and K the number of output unit in the last layer

(source: wikipedia and ml-cheatsheet)

For network 1 I understand that I should use formula 1 --> Binary classification.
For network 3, I should use formula 2 --> Multiclass classification.
But where I am confused is for network 2: I thought I should use formula 2 as well but the loss I got with this formula seems wrong whereas when I use formula 1 it seems good.

For instance, let's say that I have one sample, and my prediction ($\hat y$) and true label (y) matrices for this sample are:

$\hat y=\begin{bmatrix}0.991 & 0.998\end{bmatrix}$ and $y=\begin{bmatrix}1 & 0\end{bmatrix}$

In this case, using formula 2 will give a rather low loss, whereas it should be high. And formula 1 will provide a high loss, hence seems more correct.

I don't understand why I have to use formula 2 for network conf 2 ?

I feel it has something to do with the fact that sigmoid doesn't compute probabilities contrary to the softmax function but I have the feeling I'm doing something wrong here.

Thanks for helping !!