I wanted to check my derivation logic as I see so many places doing it online slightly differently. So I will show you my derivation and have my questions at the end:
where $n$ is the size of the vector $z$. $S$ is the softmax function.
\begin{eqnarray*} {z} = \left(\begin{array}{c} z_0\\ z_1\\ \vdots\\ z_n \end{array}\right) & & {a} = S ({z}) = \left(\begin{array}{c} S (z_0)\\ S (z_1)\\ \vdots\\ S (z_n) \end{array}\right) = \left(\begin{array}{c} \dfrac{e^{z_0 }}{{\sum^n_{j = 0}} e^{z_j }}\\ \dfrac{e^{z_1 }}{{\sum^n_{j = 0}} e^{z_j }}\\ \vdots\\ \dfrac{e^{z_n }}{{\sum^n_{j = 0}} e^{z_j }} \end{array}\right) = \left(\begin{array}{c} a_0\\ a_1\\ \vdots\\ a_n \end{array}\right) \end{eqnarray*}
I want the derivative $da/dz$. So I can take the derivative of each element.
\begin{eqnarray*} & \dfrac{d {a} }{d {z} } = \left(\begin{array}{c} \dfrac{d a_0 }{d {z} }\\ \dfrac{d a_1 }{d {z} }\\ \vdots\\ \dfrac{d a_n }{d {z} } \end{array}\right) & \end{eqnarray*}
For element $j$ in $da/dz$ I have:
$$\dfrac{d {a_j} }{d {z} } = \dfrac{d S ({z}_j)}{d {z }} = \dfrac{d \left( \dfrac{e^{z_j }}{e^{z_0 } + e^{{z_1} } + \cdots + e^{z_n }} \right)}{d {z} }$$
As the derivative $da_j/dz$ contains multiple independent variables $(z_0,z_1,…,z_n)$ I will take the total derivative as the sum of all partial derivatives where $k$ is the element in $z$ with which the derivative is in respect to.
\begin{eqnarray*} & \dfrac{d a_j}{d {z} } = {\sum^n_{k = 0}} \dfrac{d a_j }{{d z_k} } = \dfrac{d {a_j} }{{d z_0} } + \dfrac{d {a_j }}{{d z_1} } + \ldots + \dfrac{d {a_j} }{{d z_n} } & \end{eqnarray*} So now I have: \begin{eqnarray*} & \dfrac{d {a} }{d {z} } = \left(\begin{array}{c} \dfrac{d a_0 }{d {z} }\\ \dfrac{d a_1 }{d {z} }\\ \vdots\\ \dfrac{d a_n }{d {z} } \end{array}\right) = \left(\begin{array}{c} \dfrac{d {a_0} }{{d z_0} } + \dfrac{d {a_0 }}{{d z_1} } + \ldots + \dfrac{d {a_j}_0}{{d z_n} }\\ \dfrac{d {a_1} }{{d z_0} } + \dfrac{d {a_1 }}{{d z_1} } + \ldots + \dfrac{d {a_1}}{{d z_n} }\\ \vdots\\ \dfrac{d {a_n} }{{d z_0} } + \dfrac{d {a_n }}{{d z_1} } + \ldots + \dfrac{d {a_n} }{{d z_n} } \end{array}\right) & \end{eqnarray*}
From here I am able to derive all the partial derivatives and get the final derivative. I will skip the full working here as I am comfortable this is correct with what is online.
when $k=j$: $$\dfrac{d a_j}{d z_k} = \dfrac{d a_j}{d z_j} = a_j (1 - a_j)$$
when $k≠j$: $$\dfrac{d a_j}{d z_k} = - a_j a_k$$
This gives the final answer:
$$\dfrac{d {a} }{d {z} } = \left(\begin{array}{c} a_0 (1 - a_0) + - a_0 a_1 + \ldots + - a_0 a_n\\ - a_1 a_0 + a_1 (1 - a_1) + \ldots + - a_1 a_n\\ \vdots\\ - a_n a_0 + - a_n a_1 + \ldots + a_n (1 - a_n) \end{array}\right)$$
My questions:
Is it correct when I take $da/dz$ and then instead do it element wise $da_j/dz$. Then furthermore is it correct that I take $da_j/dz$ and compute it as the sum of partial derivatives $da_j/dz_0 + da_j/dz_1 + ... + da_j/dz_n$? I see some other sources doing it differently or skipping steps so I am not so sure.
Why do I see so many places online computing the Jacobian matrix for this? I understand that the Jacobian matrix contains all the partial derivatives but as far as I can see the actual derivative is the sum of all the rows in the Jacobian matrix and I think it only confuses matters.
Often I see online it is said that the Jacobian matrix IS the derivative $da/dz$. Is that correct? If it is correct, then what have I done? and furthermore how can I use a matrix for backpropigation, I thought it had to be a vector.
Thank you for any help, I think I'm close to completely understanding this derivation and hope to implement it into my neural network.