Understanding Back Propagation for ANNs

223 Views Asked by At

I have been in the process of learning back propagation for NN's from this website, and I have a question about the derivation of a few values. For these questions I am assuming that the math and theory on the website is correct. If this is not the case please let me know.

My first question concerns what form some of the variables are in. As far as I can tell the activation function $f(x)$ is an activation function like $\frac{1}{1+e^{-x}}$, which implicitly operates component wise on the inputted vectors. $f'(x)$ works in the same way. $x_n$ and $y_n$ are vectors with as many components as layer $n$ has nodes. $w_n$ is an $i$ by $j$ matrix if layer $n$ has $i$ nodes and layer $n+1$ has $j$ nodes. When being multiplied, $x$ and $y$ are treated as row vectors. My question then concerns the definition of $\delta$. My intuitive understanding of $\delta_n$ is that it is a vector indicating how the values of the nodes in layer $n$ need to change, before the activation function. As a result, each $\delta_n$ should be a vector. This is, as far as I can tell, supported by $\delta_N=f'(x_N)\left(|y_N-t|\right)$ as $f'(x_N)$ is a vector and the rest is a constant. However, I do not understand why $\delta_n=f'(x_n)w_n\delta_{n+1}$ is a vector. Are my above statements correct, and why is this last statement correct?

My second question is similar to my first, but concerns $\frac{\partial c}{\partial w_n} = \delta_{n+1}y_n$. This value should be a matrix, at least as far as I understand. However, it intuitively does not seem like one and looks more like a dot product. This could be a matrix, if the first vector were interpreted as a column vector instead of a row vector. But if this is the case, what is the mathematical justification for this? And if not, then why is the value a matrix?

1

There are 1 best solutions below

3
On BEST ANSWER

First, we can write everything in index notation to obtain a more explicit explicitly form. Consider a neural net with $N$ layers and $M^n$ perceptrons in each layer.

$x^n_i$ and $y^n_i$ are the input and output of perceptron $i$ in layer $n$.

$w^n_{ji}$ is the wight between perceptron $i$ in layer $n$ and perceptron $j$ in layer $l+1$.

$f$ is the activation function

Fix $y^n_0=1$ as the bias term for each layer

$$y^n_i=f(x^n_i)\tag 1$$

$$x^{n+1}_j=\sum_{i=0}^{M^n}w^n_{ji}y^n_i\tag 2$$

Now let $t_i$ be a training output. We can define a quadratic cost function $c$.

$$c=\sum_{i=1}^{M^N}\frac 12(y^N_i-t_i)^2\tag 3$$

Now we can define the input gradient of the cost function $\delta$.

$$\delta^n_i=\frac{\partial C}{\partial x^n_i}$$

Which we can evaluate for the last layer using chain rule.

$$\delta^N_i=(y^N_i-t_i)f'(x^N_i)\tag 4$$

And, with more chain rule, obtain a recurrence relation for $\delta$.

$$\frac{\partial c}{\partial x^n_i}=\sum_{j=1}^{M^{n+1}}\frac{\partial c}{\partial x^{n+1}_j}\frac{\partial x^{n+1}_j}{\partial y^n_i}\frac{\partial y^n_i}{\partial x^n_i}$$

$$\implies\delta^n_i=\sum_{j=1}^{M^{n+1}}\delta^{n+1}_jw^n_{ji}f'(x^n_i)\tag 5$$

Similarly, we can compute the derivatives with respect to the weights.

$$\frac{\partial C}{\partial w^n_{ji}}=\frac{\partial C}{\partial x^{n+1}_j}\frac{\partial x^{n+1}_j}{\partial w^n_{ji}}$$

$$\implies\frac{\partial C}{\partial w^n_{ji}}=\delta^{n+1}_jy^n_i\tag 6$$

These are all of the important equations in the article you provided. However, their notation convention is rather unusual. There seems to be an implied convention for row/column vectors, but I'm not sure what it would be (as a disclaimer, there are many competing schemes of notation for matrix/tensor calculus, and I have only been exposed to a few of them). Instead of interpreting the article precisely as written, I'll rewrite the equations above using the Jacobian formulation of matrix calculus. I'll use capital letters to indicate matrices, arrows $(\vec v)$ to indicate column vectors and transposes $(\vec v^T)$ to indicate row vectors. Matrix multiplication works as normally defined ; in particular the inner product $\vec u^T\vec v$ produces a scalar and the outer product $\vec u\vec v^T$ produces a matrix .

Let $\vec x_n$ and $\vec y_n$ be the input and output column vectors and $\vec\delta^T$ be the input gradient, which by convention is a row vector. Let $W_n$ be the matrix of weights between layer $n$ ad $n+1$. Also, let $f$ act componentwise on vectors ($f$ is essentially a nonlinear transform between vector spaces) .

$$f(\vec x_n)=\vec y_n \tag 1$$

$$\vec x_{n+1}=W_n\vec y_n\tag 2$$

$$c=\frac 12(\vec y_N-\vec t)^T(\vec y_n-\vec t)\tag 3$$

For $(4)$, we'll need to define $J_f(\vec x_n)$ as the Jacobian of $f$. It is a diagonal matrix with entries $f'(x^n_i)$ on the diagonal .

$$\vec\delta_N^T=(\vec y_N-\vec t)^TJ_f(\vec x_N)\tag 4$$

$$\vec\delta_n^T=\vec\delta_{n+1}^TW_nJ_f(\vec x_n)\tag 5$$

In the last equation, we conventionally define the derivative w.r.t. a matrix, with each entry at position $i, j$ containing the partial derivative w.r.t. entry $j, i$ of the matrix. The left hand side is evidently an outer product of two vectors.

$$\frac{\partial c}{\partial W_n}=\vec y_n\vec\delta_{n+1}^T\tag 6$$