Clarification: binary cross entropy derivation

158 Views Asked by At

I'm learning about cross-entropy in the context of machine learning and I've stumbled across a notation problem I'm not sure about.

As I understand, when training a machine learning model using MLE, the goal is to minimize the dissimilarity between the empirical distribution $\hat{p}_{data}$ defined by the training set and the model distribution $p_{model}$. Thus a cost function that minimizes the cross-entropy between these distributions (after some derivation) may be defined as follows:

$J(\theta) = - \mathop{\mathbb{E}}_{(x,y)\sim\hat{p}_{data}}[\log p_{model}(y | x)]$

where $\theta$ are the parameters of the model.

When dealing with a binary classification problem, the model distribution $p_{model}$ may defined as the Bernoulli distribution over $y$ conditioned on $x$. Furthermore, we are assuming that the model output lies within $(0,1)$ interval (e.g. by using a sigmoid as the last activation).

Now, I would like to derive the cost function of such model and I am unsure whether such formulation is mathematically correct:

$J(\theta) = - \mathop{\mathbb{E}}_{(x,y)\sim\hat{p}_{data}}[y\log\hat{y} + (1 - y)\log(1 - \hat{y})]$

where $\hat{y} = f(x; \theta)$ is the output of the model.

My understanding is that the expectation over this discrete random variable is simply the mean of $y\log\hat{y} + (1 - y)\log(1 - \hat{y})$ over the entire training set.

Thus my question is, whether the above notation of the cost function is correct?