I am studying Word2Vec and in the output of the Neural Network, I get a vector $\hat{y}$ that is a $1$ x $V$ vector where $V$ is the size of the vocabulary and $\hat{y_i}$ is the predicted probability that the $i$-th word in the vocabulary is the next word given the input word.
For the loss function, I am using: $$-ln(\hat{y_j*})$$ where $j*$ is the vocabulary column of the real next word. That is, $\hat{y_j*}$ should be $1$, so I'm using this expression to calculate the loss. Then, for the gradient descent, I need to partial derivate the loss function with respect to $W_2$ and $W_1$, which are the weights matrices.
I'm using the softmax function in the output layer, that means that $$\hat{y} = softmax(u)$$ where $u$ is also a $1$ x $V$ vector. So, for the $W_2$ derivative we this for this: $$\frac{\partial (-ln(\hat{y_j*}))}{\partial W_2} = \frac{\partial (-ln(softmax(u_j*)))}{\partial W_2}$$
$$\frac{\partial (-ln(softmax(u_j*)))}{\partial W_2} = \frac{\partial [-ln(\frac{e^u_j*)}{\sum_i e^u_i})]}{\partial W_2} = \frac{\partial [-ln(e^u_j*) + ln(\sum_i e^u_i)]}{\partial W_2}$$ $$= - \frac{\partial [ln(e^u_j*)]}{\partial W_2} + \frac{\partial [ln(\sum_i e^u_i)]}{\partial W_2}$$
$$= - \frac{\partial [u_j*]}{\partial W_2} + \frac{\partial [ln(\sum_i e^u_i)]}{\partial W_2} = - \frac{\partial [u_j*]}{\partial W_2} + \frac{\partial [ln(\sum_i e^u_i)]}{\partial \sum_i e^u_i} \cdot \frac{\partial (\sum_i e^u_i)}{\partial W_2}$$
$$= - \frac{\partial [u_j*]}{\partial W_2} + \frac{1}{\sum_i e^u_i} \cdot (\sum_i \frac{\partial [e^u_i]}{\partial W_2})$$
But $$u = h \cdot W_2$$ where $h$ is a $1$ x $N$ vector of the hidden layer, $N$ is the size of the hidden layer and $W_2$ is the $N$ x $V$ weight matrix.
Let's take a look at this first part: $$\frac{\partial [u_j*]}{\partial W_2}$$
We have that $u_j* = h \cdot W_{2_{( : , j*)}}$. Replacing:
$$\frac{\partial [u_j*]}{\partial W_2} = \frac{\partial [h \cdot W_{2_{( : , j*)}}]}{\partial W_2}$$
So, is this correct? How can I solve this? How can I derivate a matrix column with respect to the matrix itself in this context?
I'm new to Neural Networks and Machine Learning and I will have Vectorial Calculus classes this semester, so please forgive me for any mistake. And if that's the case, I'm sorry for my bad English too.