Gradient of a softmax layer cases issue

293 Views Asked by At

I am doing a multi-part homework assignment about differentiating neural networks. The first part asked me to derive the gradient of the log softmax indexed at output dimension w in terms of one the inputs i.

The softmax function indexed at output dimension w was written as:

$log([S(y)]_w)$ where $[S(y)]_w$ = $\frac{e^{y_w}}{\sum_j e^{y_j}}$

I found:

$\frac{\partial log([S(y)]_w)}{\partial y_i} = \begin{cases} -[S(y)]_i & w\neq i \\ 1 - [S(y)]_i & w = i\\ \end{cases}$

Then I wask asked to find:

$\frac{\partial}{\partial B} log(p_w)$ where $p_w = [S(Bh)]_w$

So I did the following:

$=\frac{\partial}{\partial B} log([S(Bh)]_w)$ substitution

$=\frac{\partial}{\partial Bh}log([S(Bh)]_w)\frac{\partial}{\partial B} Bh $ chain rule

I calculated: $\frac{\partial}{\partial B} Bh = h$

Now for my trouble. I have already calculated $\frac{\partial}{\partial y_i}log([S(y)]_w)$ so I should be able to just use this for $\frac{\partial}{\partial Bh}log([S(Bh)]_w)$. But my result has cases. How can I handle these? Have I made an error? Also I need the vector result from that to be a row vector to multiply by $h$ to get the right dimensionality in my answer. Do I have to transpose it?

Edit*

I found a one hot vector notation that might work. Would it be OK to represent it like so?

$\frac{\partial}{\partial Bh}log([S(Bh)]_w) = e_w - [S(Bh)]_w$

Edit**

So my final answer is:

$\frac{\partial}{\partial B} log(p_w) = (e_w - [S(Bh)]_w)h^T$

Is this notation correct?