I would want to calculate the Jacobian and Hessian matrix of feed-forward neural network output with given input vector, $I$:
$$A=W_n \times tansig(W_{n-1} \times ... \times tansig(W_1 \times I + B_1)+ ... +B_{n-1})+B_n$$ Where
- I is input vector
- $W_i$ is weight matrix of layer $i$
- $B_i$ is bias matrix of layer $i$
- $tansig$ is activation function - $tansig(x) = \frac{1}{1 + e^{-2x}}-1$
By applying chain rule, we calculate Jacobian matrix as shown:
Let $f_1 = tansig(W_1 \times I + B_1)$
$f_2 = tansig(W_2 \times f_1 + B_2)$
$...$
$f_{n-1} = tansig(W_{n-1} \times f_{n-2} + B_{n-1})$
$$ \to A = W_n \times f_{n-1}(f_{n-2} ... (f_1)...)+B_n$$ $$ \to Jacobian(A) = W_n \times \frac{\partial f_{n-1}}{\partial f_{n-2}} \frac{\partial f_{n-2}}{\partial f_{n-3}}... \frac{\partial f_{1}}{\partial I}$$ The derivative of $f_i$ with respect to $f_{i-1}$ is: $$ \frac{\partial f_i}{\partial f_{i-1}} = diag(dtansig(W_i \times f_{i-1} + B_i) \times W_i$$ Where $dtansig$ is the first derivative of activation $tansig$ $$dtansig(x) = \frac{4e^{2-x}}{(1 + e^{-2x})^2}-1$$
Substituting the derivative of each $f_i$ into Jacobian matrix, we have:
$$ \to Jacobian(A) = W_n \times diag\bigl(dtansig(W_{n-1} \times f_{n-2} + B_{n-1})\bigr) \times W_{n-1} \times ...\times diag\bigl(dtansig(W_1 \times I + B_1)\bigr) \times W_1$$
Now, I am having very hard time to derive $Hessian(A)$. With your knowledge and expertise, can you please help me how to find out the Hessian matrix of given neural network output, $A$.
Thank you very much!
Disclaimer: I am giving it a try, but I may have made some mistakes..
First of all,
$\frac{dtansig(x)}{dx} = -2(T^2 + T)$ where T = tansig(x)
this is because tansig(x) = $\frac{e^{2x}-1-e^{2x}}{1+e^{2x}} = \frac{-1}{1+e^{2x}}$
and, $\frac{dtansig(x)}{dx} = \frac{2e^{2x} + 2 - 2}{(1+e^{2x})^2} = -2T -2T^2$
So, W' = $\frac{dloss}{dW}$ = (-2dout*($T+T^2$)).dot(X.T),
where,
'dout' is the gradient flowing backwards. (I used numpy notations here a bit- '*' means elementwise multiplication, T.dot(X) means matrix multiplication and X.T is the transpose of X)
and T = tansig(WX+b)
from this we can get,
$\frac{d(W')}{dW}$ = (-2dout*($-2T-2T^2)*(1+2T)$).dot(X.T)).dot(X.T) as T is tansig(WX+b), we have another (.).dot(X.T) here.
Hope it helps.