Chain rule for function mapping $R^k$ to R?

56 Views Asked by At

I am working on backpropagation in machine learning. I cannot understand how exactly the chain rule applies in this specific case:

Loss = Activation(y_vector)

y_vector = F(h_vector)

Basically, when I take the gradient of Loss (a scalar) with respect to the h_vector, how will the chain rule work? As a result, Activation is a function from $R^k$ to $R$.

Where h_vector and y_vector are the same dimensions.

1

There are 1 best solutions below

0
On BEST ANSWER

Your loss function is the activation function $A$ of the variables $y_1,\ldots,y_k$.

Each of these variables is a function of the variables $h_1,\ldots,h_k$.

The gradient is $(\tfrac{\partial A}{\partial h_1},\ldots, \tfrac{\partial A}{\partial h_k})$. The chain rule gives you (for $i=1,\ldots, k$)

$$\frac{\partial A}{\partial h_i}= \frac{\partial A}{\partial y_1}\frac{\partial y_1}{\partial h_i} + \frac{\partial A}{\partial y_2}\frac{\partial y_2}{\partial h_i} + \cdots + \frac{\partial A}{\partial y_k}\frac{\partial y_k}{\partial h_i}.$$

In your notation, the $y_i$ are the components of your $F$.

More generally, you could have $k$ of the $y$-variables and $n$ of the $h$-variables as I suggested in the comment; there's no necessity to have $k=n$. Then the gradient would have $n$ components $\tfrac{\partial A}{\partial h_i}$ (so $i=1,\ldots, n$), each consisting of a sum of $k$ terms of the form $\frac{\partial A}{\partial y_j}\frac{\partial y_j}{\partial h_i}$ (so $j=1,\ldots,k$).