Is logistic regression cost function in SciKit Learn different from standard derivations?

101 Views Asked by At

I am trying to understand the math behind logistic regression. Going through a couple of websites, lectures and books, I tried to derive the cost function by thinking of it as the negative of the maximum likelihood. My derivation matches the cost function shown in this Wikipedia page https://en.wikipedia.org/wiki/Logistic_regression and in other places.

If the inputs are $x^{(i)}$ and outputs are $y^{(i)}$, where $(i)$ refers to the $i$th data point, then the cost as a function of weights $w$ seems to be

$$-\sum_{i=1}^m y^{(i)} \log\left(\frac{1}{1 + e^{-w^Tx^{(i)}}}\right)+\left(1-y^{(i)} \right) \log \left( 1-\frac{1}{1 + e^{-w^Tx^{(i)}}} \right) $$

I can simplify further to $$\sum_{i=1}^m -w^Tx^{(i)} y^{(i)} +\log⁡ (1+e^{-w^Tx^{(i)}})$$ However the expression shown in the SciKit Learn guide https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression is $$\sum_{i=1}^m \log ( 1 + e^{-w^Tx^{(i)}y^{(i)}}) $$

I have tried some algebra and am not able to derive their formulation. Am I missing something? It is highly possible that I haven't tried all the tricks there are in simplifying

1

There are 1 best solutions below

2
On BEST ANSWER

From the same scikit-learn link, note that, in this notation, it’s assumed that the target $y^{(i)}$ takes values $\{-1,+1\}$ in the set at trial $i$.

To the contrary, $y^{(i)} \in \{0,1\}$ as per your initial definition of the binary cross entropy (BCE) cost function.

In the scikit-learn notation, we have

$$P(y^{(i)}=+1\mid x^{(i)})=\sigma(w^Tx^{(i)})=\frac{1}{1+e^{-w^Tx^{(i)}}}$$

$$P(y^{(i)}=-1\mid x^{(i)})=1-\sigma(w^Tx^{(i)})=\frac{1}{1+e^{w^Tx^{(i)}}},$$

so that in both cases we have

$$P(y^{(i)}\mid x^{(i)})=\frac{1}{1+e^{-w^Tx^{(i)}y^{(i)}}}$$

With independence assumption, the likelihood is $\prod\limits_{i=1}^m \frac{1}{1+e^{-w^Tx^{(i)}y^{(i)}}}$

The negative log-likelihood is $\sum\limits_{i=1}^m \log(1+e^{-w^T x^{(i)} y^{(i)}})$, which is the cost function to be minimized (for MLE), along with some ($L_2$) regularization term.