Derivative of sigmoid function that contains vectors

4.7k Views Asked by At

Could someone show me how to take the derivative of this function with respect to $w_i$?

$f(w) = \frac{1}{1+e^{-w^Tx}}$

$w$ and $x$ are both vectors $\in \mathbb{R}^D$

How would this be different from taking the derivative with respect to $w$ itself?

5

There are 5 best solutions below

2
On BEST ANSWER

You have $$w^Tx=\sum_{i=1}^D w_ix_i$$ For the derivative with respect to $w_i$ you can write the function as $$\frac 1{1+e^{-\sum_{j=1}^D w_jx_j}}=\frac 1{1+e^{-\sum_{j=1,j\ne i}^D w_jx_j}e^{-w_ix_i}}$$ The term with the sum does not contain $w_i$, so you can consider it a constant when you take the derivative.

0
On

Define the scalar variable and its differential $$\eqalign{ \alpha &= w^Tx = x^Tw \cr d\alpha &= x^Tdw }$$ The derivative of the logistic function for a scalar variable is simple. $$\eqalign{ f &= \frac{1}{1+e^{-\alpha}} \cr f' &= f-f^2 \cr }$$ Use this to write the differential, perform a change of variables, and extract the gradient vector. $$\eqalign{ df &= \big(f-f^2\big)\,d\alpha \cr &= \big(f-f^2\big)\,x^Tdw \cr &= g^Tdw \cr \frac{\partial f}{\partial w} &= g = \big(f-f^2\big)\,x \cr }$$

0
On

$$f(\boldsymbol{w}) = \dfrac{1}{1+\exp\left[-\boldsymbol{w}^T\boldsymbol{x}\right]}$$

$$\implies \dfrac{\partial f}{\partial w_i} = \dfrac{0\cdot(1+\exp\left[-\boldsymbol{w}^T\boldsymbol{x}\right])-1\cdot\dfrac{\partial}{\partial w_i}(1+\exp\left[-\boldsymbol{w}^T\boldsymbol{x}\right])}{(1+\exp\left[-\boldsymbol{w}^T\boldsymbol{x}\right])^2}$$ $$=-\dfrac{\dfrac{\partial}{\partial w_i}(1+\exp\left[-\boldsymbol{w}^T\boldsymbol{x}\right])}{(1+\exp\left[-\boldsymbol{w}^T\boldsymbol{x}\right])^2}$$ $$=-\dfrac{\exp\left[-\boldsymbol{w}^T\boldsymbol{x}\right]\dfrac{\partial}{\partial w_i}(-\boldsymbol{w}^T\boldsymbol{x})}{(1+\exp\left[-\boldsymbol{w}^T\boldsymbol{x}\right])^2} $$ $$=\dfrac{\exp\left[-\boldsymbol{w}^T\boldsymbol{x}\right]x_i}{(1+\exp\left[-\boldsymbol{w}^T\boldsymbol{x}\right])^2}$$ $$=f(\boldsymbol{w})\left[1- f(\boldsymbol{w})\right]x_i$$

If you take the derivative with respect to $\boldsymbol{w}$ you will simply get a stacked vector of these components.

0
On

We have $f(w) = \sigma(x^T w)$ (remember that $w^Tx= x^T w)$. Hence the gradient vector with respect to $w$ is $$ \begin{align*} \frac{\partial}{\partial w}f(w) &= \sigma'(x^T w)\frac{\partial}{\partial w}(x^T w)\\ &= \sigma(x^T w)(1-\sigma(x^T w))x\\ &= \sigma(w^T x)(1-\sigma(w^T x))x. \end{align*} $$

(The first equality was from the multivariate chain rule, and the second from the fact that $\sigma'(z)= \sigma(z)(1-\sigma(z))$ and $\frac{\partial}{\partial w}(x^T w) = x$.)

Now that we know gradient vector of $f(w)$, the derivative of $f(w)$ with respect to $w_i$ is the $i$-th component of the gradient vector. The $i$-th component of $\sigma(w^T x)(1-\sigma(w^T x))x$ is $\sigma(w^T x)(1-\sigma(w^T x))x_i$. Thus $$\boxed{\frac{\partial}{\partial w_i}f(w) = \sigma(w^T x)(1-\sigma(w^T x))x_i}.$$

(Of course, you could also get this result by just differentiating with respect to $w_i$ from the start. The steps would all be the same except that instead of calculating $\frac{\partial}{\partial w}(x^T w)$ in one of the steps above, we would calculate $\frac{\partial}{\partial w_i}(x^T w)$, which is $x_i$.)

0
On

With

$x = (x_1, x_2, \ldots, x_n)^T \tag 1$

and

$w = (w_1, w_2, \ldots, w_n)^T, \tag 2$

we have

$w^Tx = \displaystyle \sum_1^n w_i x_i; \tag 3$

we observe that

$\dfrac{\partial (w^Tx)}{\partial w_j} = x_j, \; 1 \le j \le n; \tag 4$

we may write

$f(w) = \dfrac{1}{1 + e^{-w^Tx}} = (1 + e^{-w^Tx})^{-1}, \tag 5$

and deploy the chain rule:

$\dfrac{\partial f(w)}{\partial w_j} = \dfrac{df(w)}{d(w^Tx)} \dfrac{\partial (w^Tx)}{\partial w_j}$ $= -(1 + e^{-w^Tx})^{-2} (-e^{-w^Tx}) \dfrac{\partial (w^Tx)}{\partial w_j} = (1 + e^{-w^Tx})^{-2} (e^{-w^Tx}) x_j = \dfrac{e^{-w^Tx} x_j}{(1 + e^{-w^Tx})^{-2}}. \tag 6$

The above shows how to form the derivatives with respect to the individual $w_i$; to take the derivative with respect to $w$, we again call on the chain rule, but rather than (6) we have

$\dfrac{\partial f(w)}{\partial w} = \dfrac{df(w)}{d(w^Tx)} \dfrac{\partial (w^Tx)}{\partial w} = (1 + e^{-w^Tx})^{-2} (e^{-w^Tx})\dfrac{\partial (w^Tx)}{\partial w}, \tag 7$

where it remains to evaluate

$\dfrac{\partial (w^Tx)}{\partial w}; \tag 8$

but this is straightforward; we form the difference

$(w + \Delta w)^T x - w^Tx = w^Tx + \Delta w^T x - w^Tx = \Delta w^T x, \tag 9$

whence

$(w + \Delta w)^T x - w^Tx - \Delta w^T x = 0, \tag{10}$

which yields

$\Vert (w + \Delta w)^T x - w^Tx - \Delta w^T x \Vert = 0, \tag{11}$

independently of $\Vert \Delta w \Vert$; since it then follows that

$\dfrac{\Vert (w + \Delta w)^T x - w^Tx - \Delta w^T x \Vert}{\Vert \Delta w \Vert} = 0, \; \forall \Delta w \ne 0, \tag{12}$

we may conclude that the linear map

$\Delta w \to \Delta w^T x \tag{13}$

is the sought-for derivative (8); it follows then that (7) may be written

$\dfrac{\partial f(w)}{\partial w} = \dfrac{df(w)}{d(w^Tx)} \dfrac{\partial (w^Tx)}{\partial w} = (1 + e^{-w^Tx})^{-2} (e^{-w^Tx})(\cdot)^Tx; \tag{14}$

we note that this is the linear mapping from $\Bbb R^n \to \Bbb R$ given by

$\Delta w \mapsto \dfrac{ e^{-w^Tx}}{ (1 + e^{-w^Tx})^2} \Delta w^T x. \tag{15}$

We may cast this in a somewhat more normative form via the observation that, since $\Delta w^T x \in \Bbb R$ is a scalar quantity,

$\Delta w^T x = (\Delta w^T x)^T = x^T (\Delta w^T)^T = x^T \Delta w, \tag{16}$

and therefore (15) becomes

$\Delta w \mapsto \dfrac{ e^{-w^Tx}}{ (1 + e^{-w^Tx})^2} x^T \Delta w, \tag{17}$

so we at last find

$\dfrac{\partial f(w)}{\partial w} = \dfrac{df(w)}{d(w^Tx)} \dfrac{\partial (w^Tx)}{\partial w} = \dfrac{ e^{-w^Tx}}{ (1 + e^{-w^Tx})^2} x^T, \tag{18}$

is the derivative we seek.