I am trying to understand the chain rule applied to a series of transformations in the context of the back propagation algorithm for deep learning. Let $x \in \mathbb{R^k}$ and $A,B$ be real-value matrices of size $K \times K$. Then consider a network defined as $$y = Ax$$ $$u = \sigma (y)$$ $$v = Bx$$ $$z = A (u * v)$$ $$w = Az$$ $$ L = ||w||^2$$
where $L$ is considered as a function of $x, A, B$, and $u*v$ is the element-wise product, and $\sigma(y)$ is the element-wise application of the sigmoid function to $y$. Now I want to be able to calculate $\frac{\partial L }{\partial A}$ and $\frac{\partial L }{\partial B}$.
From what I understand $\frac{\partial L }{\partial A} = \frac {\partial {L}}{\partial w} \frac {\partial w} {\partial A}$
I'm not sure how to express $\frac{\partial w} {\partial A}$ since $z$ is a function of $A$. My guess would be something like $\frac {\partial w}{\partial A} = \frac{d}{dA} (Az) + A \frac{d}{dA} (z)$ but I am not sure if this step should be an application of the product rule or the chain rule.
I'm also not sure how to express $\frac {\partial z} {\partial A}$. Any insights appreciated
The first thing to do is to draw correctly the underlying computation graph, and then apply the chain rule according to that graph.
The following is the chain rule that you should remember:
Therefore, the chain rule applied to node $A$ gives $$\frac{\dv{L}}{\dv{A}} = \frac{\dv{L}}{\dv{w}}\frac{\partial w}{\partial A} + \frac{\dv{L}}{\dv{z}}\frac{\partial z}{\partial A} + \frac{\dv{L}}{\dv{y}}\frac{\partial y}{\partial A}.$$
The only unknown quantities in the above are $\frac{\dv{L}}{\dv{z}}$ and $\frac{\dv{L}}{\dv{y}}$, which can be computed using the above chain rule again applied to the nodes $z$ and $y$, respectively. This is precisely how backpropagation works.
Check my answer here for a more detailed explanation: https://math.stackexchange.com/a/3865685/31498. You should be able to fully understand backpropagation after reading that.