I'm trying to derive formulas used in backpropagation for a neural network that uses a binary cross entropy loss function. When I perform the differentiation, however, my signs do not come out right:
Binary cross entropy loss function: $$J(\hat y) = \frac{-1}{m}\sum_{i=1}^m y_i\log(\hat y_i)+(1-y_i)(\log(1-\hat y)$$
where
$m = $ number of training examples
$y = $ true y value
$\hat y = $ predicted y value
When I attempt to differentiate this for one training example, I do the following process:
Product rule: $$ \frac{dJ}{d\hat y_i} = -1(\frac{d}{d\hat y_i}(y_i\log(\hat y_i)+(1-y_i)(\log(1-\hat y)))) $$
Sum rule: $$ = -1(\frac{d}{d\hat y_i}y_i\log(\hat y_i)+\frac{d}{d\hat y_i}(1-y_i)(\log(1-\hat y))) $$
Product rule, deriv of constant (treating $y$ as a constant) and deriv of natural log: $$ = -1(\frac{y_i}{\hat y_i} + \frac{1-y_i}{1 - \hat y_i})$$
However, this is different from the expected result: $$ \frac{dJ}{d\hat y_i} = -1(\frac{y_i}{\hat y_i} - \frac{1-y_i}{1 - \hat y_i}) $$
Not sure what's going wrong. I'm sure I'm doing something incorrectly, but I can't figure out what it is. Any help is appreciated!
Let's denote the inner/Frobenius product by $a:b= a^Tb$
and the elementwise/Hadamard product by $a\odot b$
and elementwise/Hadamard division by $\frac{a}{b}$
and note that the $\log$ function is to be applied elementwise.
For convenience, let's use a modified loss function $$L=-mJ$$ Then the differential and gradient of $L$ can be calculated as $$\eqalign{ L &= y:\log({\hat y}) + (1-y):\log(1-{\hat y}) \cr \cr dL &= y:d\log({\hat y}) + (1-y):d\log(1-{\hat y}) \cr &= \frac{y}{{\hat y}}:d{\hat y} + \frac{1-y}{1-{\hat y}}:d(1-{\hat y}) \cr &= \Big(\frac{y}{{\hat y}} - \frac{1-y}{1-{\hat y}}\Big):d{\hat y} \cr &= \Big(\frac{y-{\hat y}}{{\hat y}-{\hat y}\odot{\hat y}}\Big):d{\hat y} \cr \cr \frac{\partial L}{\partial{\hat y}} &= \frac{y-{\hat y}}{{\hat y}-{\hat y}\odot{\hat y}} \cr \cr }$$ And the gradient of the original cost function is $$\eqalign{ \frac{\partial J}{\partial{\hat y}} &= -\frac{1}{m}\frac{\partial L}{\partial{\hat y}} = \frac{{\hat y}-y}{m\,({\hat y}-{\hat y}\odot{\hat y})} \cr }$$