derivative of cost function for Logistic Regression

108.6k Views Asked by At

I am going over the lectures on Machine Learning at Coursera.

I am struggling with the following. How can the partial derivative of

$$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}y^{i}\log(h_\theta(x^{i}))+(1-y^{i})\log(1-h_\theta(x^{i}))$$

where $h_{\theta}(x)$ is defined as follows

$$h_{\theta}(x)=g(\theta^{T}x)$$ $$g(z)=\frac{1}{1+e^{-z}}$$

be $$ \frac{\partial}{\partial\theta_{j}}J(\theta) =\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{i})-y^i)x_j^i$$

In other words, how would we go about calculating the partial derivative with respect to $\theta$ of the cost function (the logs are natural logarithms):

$$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}y^{i}\log(h_\theta(x^{i}))+(1-y^{i})\log(1-h_\theta(x^{i}))$$

4

There are 4 best solutions below

16
On BEST ANSWER

The reason is the following. We use the notation:

$$\theta x^i:=\theta_0+\theta_1 x^i_1+\dots+\theta_p x^i_p.$$

Then

$$\log h_\theta(x^i)=\log\frac{1}{1+e^{-\theta x^i} }=-\log ( 1+e^{-\theta x^i} ),$$ $$\log(1- h_\theta(x^i))=\log(1-\frac{1}{1+e^{-\theta x^i} })=\log (e^{-\theta x^i} )-\log ( 1+e^{-\theta x^i} )=-\theta x^i-\log ( 1+e^{-\theta x^i} ),$$ [ this used: $ 1 = \frac{(1+e^{-\theta x^i})}{(1+e^{-\theta x^i})},$ the 1's in numerator cancel, then we used: $\log(x/y) = \log(x) - \log(y)$]

Since our original cost function is the form of:

$$J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}y^{i}\log(h_\theta(x^{i}))+(1-y^{i})\log(1-h_\theta(x^{i}))$$

Plugging in the two simplified expressions above, we obtain $$J(\theta)=-\frac{1}{m}\sum_{i=1}^m \left[-y^i(\log ( 1+e^{-\theta x^i})) + (1-y^i)(-\theta x^i-\log ( 1+e^{-\theta x^i} ))\right]$$, which can be simplified to: $$J(\theta)=-\frac{1}{m}\sum_{i=1}^m \left[y_i\theta x^i-\theta x^i-\log(1+e^{-\theta x^i})\right]=-\frac{1}{m}\sum_{i=1}^m \left[y_i\theta x^i-\log(1+e^{\theta x^i})\right],~~(*)$$

where the second equality follows from

$$-\theta x^i-\log(1+e^{-\theta x^i})= -\left[ \log e^{\theta x^i}+ \log(1+e^{-\theta x^i} ) \right]=-\log(1+e^{\theta x^i}). $$ [ we used $ \log(x) + \log(y) = log(x y) $ ]

All you need now is to compute the partial derivatives of $(*)$ w.r.t. $\theta_j$. As $$\frac{\partial}{\partial \theta_j}y_i\theta x^i=y_ix^i_j, $$ $$\frac{\partial}{\partial \theta_j}\log(1+e^{\theta x^i})=\frac{x^i_je^{\theta x^i}}{1+e^{\theta x^i}}=x^i_jh_\theta(x^i),$$

the thesis follows.

0
On

@pedro-lopes, it is called as: chain rule. $$(u(v))' = u(v)' * v'$$ For example: $$y = \sin(3x - 5)$$ $$u(v) = \sin(3x - 5)$$ $$v = (3x - 5)$$ $$y' = \sin(3x - 5)' = \cos(3x - 5) * (3 - 0) = 3\cos(3x-5)$$

Regarding: $$\frac{\partial}{\partial \theta_j}\log(1+e^{\theta x^i})=\frac{x^i_je^{\theta x^i}}{1+e^{\theta x^i}}$$ $$u(v) = \log(1+e^{\theta x^i})$$ $$v = 1+e^{\theta x^i}$$ $$\frac{\partial}{\partial \theta}\log(1+e^{\theta x^i}) = \frac{\partial}{\partial \theta}\log(1+e^{\theta x^i}) * \frac{\partial}{\partial \theta}(1+e^{\theta x^i}) = \frac{1}{1+e^{\theta x^i}} * (0 + xe^{\theta x^i}) = \frac{xe^{\theta x^i}}{1+e^{\theta x^i}} $$ Note that $$\log(x)' = \frac{1}{x}$$ Hope that I answered on your question!

3
On

We have, \begin{align*} L(\theta) &= -\frac{1}{m}\sum\limits_{i=1}^{m}{y_i. log P(y_i|x_i,\theta) + (1-y_i). \log{(1 - P(y_i|x_i,\theta))}} \\ h_\theta(x_i) &= P(y_i|x_i,\theta) = P(y_i=1|x_i,\theta) = \frac{1}{1+\exp{\left(-\sum\limits_k \theta_k x_i^k \right)}} \end{align*}

Then, \begin{align*} \log{(P(y_i|x_i,\theta))}=\log{(P(y_i=1|x_i,\theta))} &=-\log{\left(1+\exp{\left(-\sum\limits_k \theta_k x_i^k \right)} \right)} \\ \Rightarrow \frac{\partial }{\partial \theta_j} log P(y_i|x_i,\theta) =\frac{x_i^j.\exp{\left(-\sum\limits_k \theta_k x_i^k\right)}}{1+\exp{\left(-\sum\limits_k \theta_k x_i^k\right)}} &= x_i^j.\left(1-P(y_i|x_i,\theta)\right) \end{align*} and \begin{align*} \log{(1-P(y_i|x_i,\theta))}=\log{(1-P(y_i=1|x_i,\theta))} &=-\sum\limits_k \theta_k x_i^k -\log{\left(1+\exp{\left(-\sum\limits_k \theta_k x_i^k \right)} \right)} \\ \Rightarrow \frac{\partial }{\partial \theta_j} \log{(1 - P(y_i|x_i,\theta))} &= -x_i^j + x_i^j.\left(1-P(y_i|x_i,\theta)\right) = -x_i^j.P(y_i|x_i,\theta) \\ \end{align*}

Hence,

\begin{align*} \frac{\partial }{\partial \theta_j} L(\theta) &= -\frac{1}{m}\sum\limits_{i=1}^{m}{y_i.\frac{\partial }{\partial \theta_j} log P(y_i|x_i,\theta) + (1-y_i).\frac{\partial }{\partial \theta_j} \log{(1 - P(y_i|x_i,\theta))}} \\ &=-\frac{1}{m}\sum\limits_{i=1}^{m}{y_i.x_i^j.\left(1-P(y_i|x_i,\theta)\right) - (1-y_i).x_i^j.P(y_i|x_i,\theta)} \\ &=-\frac{1}{m}\sum\limits_{i=1}^{m}{y_i.x_i^j - x_i^j.P(y_i|x_i,\theta)} \\ &=\frac{1}{m}\sum\limits_{i=1}^{m}{(P(y_i|x_i,\theta)-y_i).x_i^j} \end{align*} (Proved)

0
On

$ \def\o{{\tt1}}\def\p{\partial}\def\J{{\cal J}} \def\LR#1{\left(#1\right)} \def\BR#1{\Bigl(#1\Bigr)} \def\diag#1{\operatorname{diag}\LR{#1}} \def\diagb#1{\operatorname{diag}\BR{#1}} \def\Diag#1{\operatorname{Diag}\LR{#1}} \def\Diagb#1{\operatorname{Diag}\BR{#1}} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\qif{\quad\iff\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\fracLR#1#2{\LR{\frac{#1}{#2}}} $For ease of typing, replace the Greek symbol $(\theta\to w)\,$ and collect all of the $x_k$ vectors into a matrix, i.e. $$\eqalign{ X = {\tt[}x_1\;x_2\ldots\,x_m {\tt]} \\ }$$ What you have called $g(z)$ is actually the logistic function which has a well-known derivative $$\frac{dg}{dz} = (1-g)\,g \qif dg = (1-g)\,g\;dz$$ When applied elementwise to the vector argument $(X^Tw),\,$ it produces a vector result $$\eqalign{ h &= g(X^Tw) \\ dh &= \LR{\o-h}\odot h\odot d(X^Tw) \\ &= \LR{\o-h}\odot h\odot (X^Tdw) \\ }$$ where $(\odot)$ denotes the elementwise/Hadamard product.

But a Hadamard product with a vector can be replaced by the standard product by using a diagonal matrix created from the vector. Therefore $$\eqalign{ H &= \Diag h &\qif h = \diag H = H\o \\ dh &= \LR{I-H}HX^Tdw &\qif \grad hw = \LR{I-H}HX^T \\ }$$ The cost function can now be expressed in a purely matrix form $$\eqalign{ Y &= \Diag y \\ \J &= -\fracLR 1m\BR{Y:\log(H)+(I-Y):\log(I-H)} \\ }$$ where $(:)$ denotes the Frobenius inner product $$A:B = \trace{A^TB} = \trace{AB^T}$$ Since diagonal matrices are almost as easy to work with as scalars, it becomes a rather straightforward if tedious exercise to calculate the gradient $$\eqalign{ d\J &= -\fracLR 1m\BR{Y:d\log(H)+(I-Y)\,:\,d\log(I-H)} \\ &= -\fracLR 1m\BR{Y:H^{-1}dH \;-\; (I-Y)\,:\,(I-H)^{-1}dH} \\ &= -\fracLR 1m\BR{H^{-1}Y \;-\; (I-H)^{-1}(I-Y)}\,:\,\Diag{dh} \\ &= -\fracLR 1m\diagb{H^{-1}Y \;-\; (I-H)^{-1}(I-Y)}\,:\,dh \\ &= -\fracLR 1m\BR{H^{-1}Y \;-\; (I-H)^{-1}(I-Y)}\,\o\,:\,\LR{I-H}HX^Tdw \\ &= -\fracLR 1mX\LR{I-H}H\BR{H^{-1}Y \;-\; (I-H)^{-1}(I-Y)}\,\o\,:\,dw \\ &= -\fracLR 1mX\BR{\LR{I-H}Y \;+\; H(Y-I)}\,\o\,:\,dw \\ &= -\fracLR 1mX\BR{Y-HY \;+\;HY - H}\,\o\,:\,dw \\ &= -\fracLR 1mX\BR{Y-H}\,\o\,:\,dw \\ &= +\fracLR 1mX\BR{h-y}\,:\,dw \\ \grad{\J}{w} &= \fracLR 1mX\BR{h-y} \\ }$$