Is maximum entropy loss (maxent) convex?

58 Views Asked by At

Let $x^{(i)}\in \mathbb{R}^n$ and $y^{(i)}\in \mathbb{R}^c$ for $i=1,\dots, N$ and $\Theta \in \mathbb{R}^{c \times n}$ where $\theta_{\ell \bullet}$ is the $\ell$-th column of $\Theta$. Let $g: \mathbb{R}^n \times \mathbb{R}^{c \times n} \to \mathbb{R}^c$ be a vector-valued function where $$ g(x^{(i)}, \Theta)= [\frac{e^{\theta_{1 \bullet}^{\top}}x^{(i)}}{\sum_{j=1}^c e^{\theta_{j \bullet}^{\top}x^{(i)}}}, \dots, \frac{e^{\theta_{c \bullet}^{\top}}x^{(i)}}{\sum_{j=1}^c e^{\theta_{j \bullet}^{\top}x^{(i)}}} ]^{\top} $$

Let $f(x^{(i)}, \Theta, y^{(i)})= ||g(x^{(i)}, \Theta)-y^{(i)}||_2^2$ and $R_N(x^{(i)}, \Theta, y^{(i)})=\frac{1}{N}\sum_{i=1}^{N}f(x^{(i)}, \Theta, y^{(i)})$. Then, the best $\Theta$ that fits to our data is the following:

$$ \Theta^* = \arg\min_{\Theta \in \mathbb{R}^{c \times n}} R_N(x^{(i)}, \Theta, y^{(i)}) = \arg\min_{\Theta \in \mathbb{R}^{c \times n}} \frac{1}{N}\sum_{i=1}^{N}f(x^{(i)}, \Theta, y^{(i)}) $$

$\Theta^*$ gives the best classification loss. This model is also called maximum entropy or (maxent model) because $y^{(i)}$'s are one-hot vecotr, i.e., $y^{(i)}_{j=k}=1$ and $y^{(i)}_{j\neq k}=0$ for $k=1,\dots, c$ and one wants to maximize $g(x^{(i)}, \theta)_j=\frac{e^{\theta_{k \bullet}^{\top}}x^{(i)}}{\sum_{j=1}^c e^{\theta_{j \bullet}^{\top}x^{(i)}}}$ by chosing the best $\Theta$.

My question: Is $R_N(x^{(i)}, \Theta, y^{(i)})$ convex of $\Theta$?