I am teaching myself ai / machine learning without libraries
I understand most of the derivative of the softmax activation function
if I have 3 nodes in a layer
then the softmax activation equation becomes
e^x_i / sigma e^x_i
and its derivative returns a vector of partial derivatives that can be described as
node 1 = {e^x_1(1 - e^x_1), e^x_1(0 - e^x_2), e^x_1(0 - e^x_3)}
node 2 = {e^x_2(0 - e^x_1), e^x_2(1 - e^x_2), e^x_2(0 - e^x_3)}
node 3 = {e^x_3(0 - e^x_1), e^x_3(0 - e^x_2), e^x_3(1 - e^x_3)}
which is a real result described by the equation
softmaxActivation(i) * (kroneker - softmaxActivation(j))
which means in a layer that contains 3 nodes. each node will have a derivative make up of 3 partial derivatives
how do I get that back to a scalar? Is it true that i need a scalar for backpropagation? can I use the directional derivative ( a new concept to me ) - which can be described as gradient * vector
or in a slightly more tangible description directional vector for node 1 = ((e^x_1(1 - e^x_1)) * e^x_1 + (e^x_1(0 - e^x_2) + e^x_2) + (e^x_1(0 - e^x_3 + e^x_3)
Im teaching myself so any guidance would be appreciated thank you