Neural Networks: checking a proof that the cost function $C=\frac{1}{2}||y-a(w)||^2$ is differentiable everywhere.

44 Views Asked by At

In his book Neural Networks and Deep Learning, Michael Nielsen introduces the cost function $$C_x:=\frac{1}{2}||y(x)-a||^2$$

as the cost of a single training example $x$, where $y(x)$ is the desired output while $a$ is the output itself, and thus a function of the weights, biases, and $x$. Each component of the vector $a$ would correspond with the output of one of the neurons in the last layer of the network, and each component of $y$ would correspond to the output we desire neuron to have.

Nielsen follows to apply multivariable calculus to the cost function $C_x$, particularly he seems to assume that the total differential of the function with respect to any weight $w$ or bias $b$ always exists, yet he doesn't seem to prove that anywhere. So that is what I set out to prove, and I would like to know if my proof is correct.


Lemma: letting $w$ be a weight in the network and $a$ the output of a particular neuron $N$, then ${\partial a}/{\partial w}$ exists and is continuous everywhere.

Proof:

We use induction on the position of $N$'s layer.

$\ \ \ $ Base case: suppose $N$ is in the seconds layer (the layer in front of the input layer). If the weight $w$ corresponds to a neuron in a different layer or to a neuron in the same layer as $N$ but different than $N$ then $\partial a /\partial w = 0$ and the result follows trivially. Otherwise the weight $w$ corresponds to $N$ and we have

$$a(w)=\sigma (wx+p)$$

for some contant $p$, input $x$, and sigmoid function $\sigma (z)= 1/(1+e^{-z})$. Thus $a'(w)$ exists and equals

$$x\sigma '(wx+p)$$

which is continuous with respect to $w$.

$\ \ \ $ Inductive step: suppose now that $N$ is not in the second layer. We can write $a(w)$ as

$$\sigma \Big( \sum_{j}w_jx_j + b\Big)$$

for the weights $w_j$, activations $x_j$ of the previous layer and bias $b$. By induction each $\partial x_j/\partial w$ exists and is continuous with respect to $w$. Computing $a'(w)$ we get

$$\Big( \sum_j w_jx_j'(w)\Big)\sigma '\Big(\sum_j w_jx_j(w)+b\Big)$$

The first sum is continuous everwhere since so is each of its summands (by induction). The second sum is continuous everywhere since so is each $x_j$ (by existence of its derivative). Finally, since $\sigma '$ is continuous we have that $a'(w)$ is continuous everywhere.

Theorem: the cost function $C_x$ is differentiable everywhere, in the sense that its total differential exists everywhere.

Proof: we'll show that each partial derivative $\partial C_x/\partial w$ is continuous everywhere and the result will follow.

$$C_x=\frac{1}{2}||y-a||^2=\frac{1}{2}\sum_k (y_k-a_k)^2$$

thus we have

$$\frac{\partial C_x}{\partial w}=\sum_k\frac{1}{2}2\big(y_k-a_k(w)\big)\big(-a_k'(w)\big)=\sum_k\big( a_k(w)-y_k\big)a_k'(w)$$

and by the lemma each summand is continuous everywhere, and thus so is the entire expression. The theorem follows.


Is the proof correct?