To train a word2vec model (vector representations of word meanings) the following expression needs to be optimized:
$$p(\text{context words}| \text{current word}) = \frac{\exp(u_o^\top v_c)}{\sum_{w_i = 1}^V\exp(u_o^\top v_c)}$$
where $u_w$ and $u_o$ are matrices of context word vectors and $v_c$ is the current word.
Above expression can be rewritten as $$\log \exp(u_o^\top v_c) - \log \sum_{w_i = 1}^V\exp(u_o^\top v_c) = u_o^\top v_c - \sum_{w_i = 1}^V\exp(u_o^\top v_c)$$
Now the gradient of that function needs to be found to improve the model.
In the source I am consulting finding the gradient starts by finding the derivative of the expression $u_o^\top v_c$ with respect to $v_c$.
\begin{align*} \frac{\partial}{\partial v} u_o \bullet v_c = u_o \end{align*}
I don't understand what is being done there. As far as I know $\frac{\partial}{\partial v}f$ is one of the many ways to denote the directional derivative. But that does not seem to be what it is. When I searched for "derivative with respect to vector" I found the claim
(1)
The first derivative of a scalar-valued function $f$ with respect to a vector is called the gradient of $f$
[ https://onlinelibrary.wiley.com/doi/pdf/10.1002/0471705195.app3 ]
That seems to fit the bill, as $p$ is scalar valued.
I also found
(2)
Let $x \in \mathbb{R}^n$ (a column vector) and let $f : \mathbb{R}^n \to \mathbb{R}^m$. The derivative of $f$ with respect to x is the $m \times n$ matrix $\frac{\partial f}{\partial x} = $ [...] . $\frac{\partial f}{\partial x}$ is called the Jacobian matrix of $f$.
[ http://www.cs.huji.ac.il/~csip/tirgul3_derivatives.pdf ]
That seems to be the generalization of (1).
(3)
$\frac{d f}{d v} = \left[ \frac{d f}{d v_1}, \dots, \frac{d f}{d v_n}\right]^\top$ where $v \in \mathbb{R}^n$ and $f : \mathbb{R}^m \to \mathbb{R}$
[ https://www.youtube.com/watch?v=iWxY7VdcSH8 ]
I am still somewhat lost here, anyway. Can you maybe tell me
1) which if any of the above cases is applied here
2) why (1) actually is true (primary question, if actually what happens here)
3) what I need to look for to learn more about derivatives with respect to vectors, as I did not find a lot more than the above three claims.
Also I am a little irritated about his use of the chain rule in the next steps of the manipulation. He rewrites
$$\frac{\partial}{\partial v_c} \log \sum_{w=1}^V \exp(u_o^\top v_c)$$
as
$$\frac{1}{\sum_{w=1}^V \exp(u_o^\top v_c)} \frac{\partial}{\partial v_c} \sum_{w=1}^V \exp(u_o^\top v_c) $$
Why can he do that? According to (3) I would have expected
\begin{align*} \frac{\partial}{\partial v_c} \log \sum_{w=1}^V \exp(u_o^\top v_c) &= \begin{bmatrix} \frac{\partial}{\partial v_{c_1}} \log \sum_{w=1}^V \exp(u_o^\top v_c)\\ \vdots\\ \frac{\partial}{\partial v_{c_n}} \log \sum_{w=1}^V \exp(u_o^\top v_c) \end{bmatrix}\\ &= \begin{bmatrix} \frac{1}{\sum_{w=1}^V \exp(u_o^\top v_c)} \frac{\partial}{\partial v_{c_1}} \sum_{w=1}^V \exp(u_o^\top v_c)\\ \vdots\\ \frac{1}{\sum_{w=1}^V \exp(u_o^\top v_c)} \frac{\partial}{\partial v_{c_n}} \sum_{w=1}^V \exp(u_o^\top v_c) \end{bmatrix} \end{align*}
which, if I am not mistaken, does in the end yield the same final result as in the video, but I would like to know if his shortcut has a general form with a proof.