softmax binary classification and monotonically decreasing substitution function

47 Views Asked by At

In this paper and later in this paper authors explain a modification to the original softmax function. If the original softmax function is: $$\mbox{soft}(i) = \frac{e^{w_{y_i}^T\cdot \ x_i}}{\sum_je^{w_{y_j}^T \cdot \ x_i}}$$

then another representation will be:

$$\mbox{soft}(i) = \frac{e^{\|w_{y_i}\|\|x_i\|\cos(\theta_{yi})}}{\sum_je^{\|w_{y_j}\|\|x_i\|\cos(\theta_{yj})}}$$

so if the input feature $x$ is of class 1 then we require: $$\|w_1\|\|x\|\cos(\theta_1) > \|w_2\|\|x\|\cos(\theta_2)\qquad where:\quad 0 < \theta_i < \pi$$

to make the classification harder, the authors propose to force: $$\|w_1\|\|x\|\cos(m\theta_1) > \|w_2\|\|x\|\cos(\theta_2)\qquad where:\quad 0 < \theta_1 < \frac{\pi}{m}$$ and $m$ is a positive integer number. Then they state that because the following inequality holds: $$\|w_1\|\|x\|\cos(\theta_1) \geq\|w_1\|\|x\|\cos(m\theta_1) > \|w_2\|\|x\|\cos(\theta_2)$$ therefore $\|w_1\|\|x\|\cos(\theta_1) > \|w_2\|\|x\|\cos(\theta_2)$ has to hold. So they re-write the softmax function using a $\psi$ function: $$\mbox{soft}(i) = \frac{e^{\|w_{y_i}\|\|x_i\|\psi(\theta_{yi})}}{e^{\|w_{y_i}\|\|x_i\|\psi(\theta_{yi})} + \sum_{y_j\ne y_i}e^{\|w_{y_j}\|\|x_i\|\cos(\theta_{yj})}}$$

and finally they define $\psi$ as:

$$\psi(\theta) = \cos(m\theta)\qquad for \qquad0\le \theta \le \pi/m$$ $$\psi(\theta) = D(\theta)\qquad for \qquad\pi/m< \theta \le \pi$$

So here are my questions:

    1. Why they require $D(\theta)$ or in general $\psi(\theta)$ to be monotonically decreasing?
    1. Why do they have to define another function i.e. $D(\theta)$ in a range $\pi/m< \theta \le \pi$? Does it have anything to do with the gradient descent optimization?
1

There are 1 best solutions below

0
On

$\psi(\theta)$ is monotonically decreasing because $\cos (m\theta )$ is monotonically decreasing on $0 \leq \theta \leq \pi / m$.

$D(\theta )$ is then required to be compatible with $\cos (m\theta )$ (in the first paper you list they note that $D(\pi /m) = \cos (\pi )$ (they actually say $D(\pi /m) = \cos (\pi /m)$ but this is clearly a typo given the definition of $\psi$). Since the original softmax function uses $\cos $ on $[0, \pi]$ and $\cos$ is monotonically decreasing on the whole interval, $D(\theta )$ is required to keep that condition.

$D(\theta )$ is required because they need to state what the softmax function does on $\pi / m \leq \theta \leq \pi$. They intend to use $\theta $ across its whole range, so $\psi $ needs to be defined there. In the (first) paper they actually define $D(\theta )$ piecewise:

$$ \psi (\theta ) := (-1)^k \cos (m\theta) - 2k \quad \theta \in \left[\frac{k\pi}{m} , \frac{(k+1)\pi}{m} \right], \mbox{ and } k\in \{0, 1, \ldots , m-1 \} $$