What is the derivative of the Softmax function AFTER subtracting the maximum value from each input?

86 Views Asked by At

I'm using the Softmax function as the activation function for the last layer of a neural network I am trying to code up. The function takes in a vector of elements, $\vec{z}$, where the length of $\vec{z}$ is $L$. The function returns the activation values of the neurons or a vector $\vec{a}$ also of length $L$. The function is:

\begin{equation} a_j = \dfrac{e^{z_j}}{\sum_{k=1}^{L}e^{z_k}} \end{equation}

$a_1$ would be the first element in $\vec{a}$, then $a_2$, until the last element is $a_L$. The values from $\vec{z}$ are exponentiated as to reduce their possible range to $(0,\infty)$. Dividing each of the exponentiated elements by the sum of all the exponentiated elements would then return values in a range of $(0,1)$.

Since $a_j$ is dependent on all the values of $\vec{z}$ -- the summation in the denominator includes all values of $\vec{z}$ -- we could think of $a_j$ as a function of each of the elements in $\vec{z}$, i.e. $a_j(z_1, z_2, z_3, ..., z_L)$. With this we can take a partial derivative with respect to one of the elements of $\vec{z}$, i.e. $\dfrac{\partial a_j}{\partial z_i}$ where $z_i$ is an arbitrary element of $\vec{z}$.

It can be shown that this derivative is dependent on whether $i=j$. This page describes the derivation and shows that:

\begin{equation} \dfrac{\partial a_j}{\partial z_i} = a_j \left( \delta_{ij} - a_i\right) \end{equation} where \begin{equation} \delta_{ij} = \left\{ \begin{array}{lr} 1, & \text{if } i = j\\ 0, & \text{if } i\ne j \end{array} \right\} \end{equation}

My question is about subtracting the maximum of the vector before doing the Softmax function. If $\max(\vec{z}) = \hat{z}$, then the Softmax function would look like:

\begin{equation} a_j = \dfrac{e^{z_j - \hat{z}}}{\sum_{k=1}^{L}e^{z_k-\hat{z}}} \end{equation}

The reasoning behind doing this is to eliminate the possibility of an overflow error when exponentiating $z_j$. By subtracting the largest element from the vector from each of the elements, the largest value would now be $0$ and all other elements would be negative. When exponentiated this would then produce a possible range of $(0,1]$.

No resources that I found have showed that this subtraction preserves the derivative. I tried to derive this myself but got stuck when noticing that $\hat{z}$ is a function of the elements of $\vec{v}$, i.e. $\hat{z} = \max(z_1, z_2, z_3, ..., z_L)$. This in turn, allows a partial derivative to be taken such as $\dfrac{\partial \hat{z}}{\partial z_i}$. I would like to understand how the original partial derivative $\dfrac{\partial a_j}{\partial z_i}$ is derived when subtracting the maximum value of the vector from each element.

One thing that could maybe work is to do the chain rule. If we let $\overset{*}{z_i} = z_i-\hat{z}$, we could use the chain rule:

\begin{equation} \dfrac{\partial a_j}{\partial z_i} = \dfrac{\partial a_j}{\partial \overset{*}{z_i}} \dfrac{\partial \overset{*}{z_i}}{\partial z_i} \end{equation}

This last chain rule is something that I found tricky to solve, and it might not even be a correct way of approaching this change to the Softmax derivative.