I'm learning about cross-entropy in the context of machine learning and I've stumbled across a notation problem I'm not sure about.
As I understand, when training a machine learning model using MLE, the goal is to minimize the dissimilarity between the empirical distribution $\hat{p}_{data}$ defined by the training set and the model distribution $p_{model}$. Thus a cost function that minimizes the cross-entropy between these distributions (after some derivation) may be defined as follows:
$J(\theta) = - \mathop{\mathbb{E}}_{(x,y)\sim\hat{p}_{data}}[\log p_{model}(y | x)]$
where $\theta$ are the parameters of the model.
When dealing with a binary classification problem, the model distribution $p_{model}$ may defined as the Bernoulli distribution over $y$ conditioned on $x$. Furthermore, we are assuming that the model output lies within $(0,1)$ interval (e.g. by using a sigmoid as the last activation).
Now, I would like to derive the cost function of such model and I am unsure whether such formulation is mathematically correct:
$J(\theta) = - \mathop{\mathbb{E}}_{(x,y)\sim\hat{p}_{data}}[y\log\hat{y} + (1 - y)\log(1 - \hat{y})]$
where $\hat{y} = f(x; \theta)$ is the output of the model.
My understanding is that the expectation over this discrete random variable is simply the mean of $y\log\hat{y} + (1 - y)\log(1 - \hat{y})$ over the entire training set.
Thus my question is, whether the above notation of the cost function is correct?