I am working on backpropagation in machine learning. I cannot understand how exactly the chain rule applies in this specific case:
Loss = Activation(y_vector)
y_vector = F(h_vector)
Basically, when I take the gradient of Loss (a scalar) with respect to the h_vector, how will the chain rule work? As a result, Activation is a function from $R^k$ to $R$.
Where h_vector and y_vector are the same dimensions.
Your loss function is the activation function $A$ of the variables $y_1,\ldots,y_k$.
Each of these variables is a function of the variables $h_1,\ldots,h_k$.
The gradient is $(\tfrac{\partial A}{\partial h_1},\ldots, \tfrac{\partial A}{\partial h_k})$. The chain rule gives you (for $i=1,\ldots, k$)
$$\frac{\partial A}{\partial h_i}= \frac{\partial A}{\partial y_1}\frac{\partial y_1}{\partial h_i} + \frac{\partial A}{\partial y_2}\frac{\partial y_2}{\partial h_i} + \cdots + \frac{\partial A}{\partial y_k}\frac{\partial y_k}{\partial h_i}.$$
In your notation, the $y_i$ are the components of your $F$.
More generally, you could have $k$ of the $y$-variables and $n$ of the $h$-variables as I suggested in the comment; there's no necessity to have $k=n$. Then the gradient would have $n$ components $\tfrac{\partial A}{\partial h_i}$ (so $i=1,\ldots, n$), each consisting of a sum of $k$ terms of the form $\frac{\partial A}{\partial y_j}\frac{\partial y_j}{\partial h_i}$ (so $j=1,\ldots,k$).