$\newcommand{bm}[1]{\mathbf{#1}}$Given the semi-orthogonal fat matrix ${\bm B} \in\mathbb R^{c \times d}$ (i.e., $c\leq d$, $\bm {BB}^\top=\bm I$), the matrix $\bm X \in {\Bbb R}^{m \times n}$, $c$ one-hot vectors $\mathbf y_1, \mathbf y_2, \dots, \mathbf y_c \in \mathbb R^c$, let the cost function $J : {\Bbb R}^{m \times d} \to {\Bbb R}$ be defined by
$$J ({\bm W}) := -\frac1n\sum_{i=1}^n\frac1{1 + \frac1{c-1} \sum\limits_{1\leq j\leq c \wedge \bm y_j\neq \bm y_i} \exp \left((\bm y_j-\bm y_i)^\top \bm B\bm W^\top\bm x_i \right)}$$
Let $\bm X = \begin{bmatrix} \bm x_1 & \bm x_2 & \dots & \bm x_n \end{bmatrix}$; $n>c$. I have worked out the gradient $\nabla_{\bm W}J$ as:
$$\nabla_{\bm W}J=\frac1{n(c-1)}\bm X\left[\bm M-\mathrm {diag}(\bm M\bm e)\bm Y^\top\right]\bm B$$
where $\bm e = 1^{c\times 1}$. $\bm Y\in (0, 1)^{c\times n}$ is a fixed "one-hot" column vector matrix where $\bm y_i$ corresponds to $\bm x_i$, and $\bm M\in\mathbb R_+^{n\times c}$ is a matrix of scaled exponential elements such that $$M_{ij} = J_i^2\exp\left((\bm y_j-\bm y_i)^\top \bm B\bm W^\top\bm x_i\right)$$
where $J_i$ is the $i$th summation term in $J$.
I am trying to understand the convergence properties of $J$. Prima facie looking at $J$, it appears that $J$ is minimized when $\Vert\bm W\Vert\to\infty$ and $(\bm y_j-\bm y_i)^\top \bm B\bm W^\top\bm x_i<0 ,\forall ij$. However, when I look at the gradient, we have at convergence: $$\bm X\left[\bm M-\mathrm {diag}(\bm M\bm e)\bm Y^\top\right]\bm B=\bm 0$$
Although I understand that $$(\Vert\bm W\Vert\to\infty;(\bm y_j-\bm y_i)^\top \bm B\bm W^\top\bm x_i<0,\forall ij)\implies\nabla_{\bm w}J\to\bm 0\implies\bm M\to\bm Y^\top\implies J\to-1$$ but from looking at the equation, it would appear that there might be local minima in the function since, in general, $\bm X\bm A\bm B=\bm 0$ does not require that $\bm A=\bm 0$.
If so, given that $\bm X$ and $\bm B$ are fixed, can we somehow find the conditions on $\bm M$ that result in convergence?
Update:
Noting that $\bm {BB}^\top=\bf I$, simplifies the convergence $\nabla_{\bm W}J=\bm 0$ to:
$$\bm X\bm M=\bm X\mathrm {diag}(\bm M\bm e)\bm Y^\top$$ $$\implies\bm M^\top\bm X^\top=\bm Y\mathrm {diag}(\bm M\bm e)\bm X^\top$$
This tells me that $\bm M^\top$ is equivalent to a transformation that projects $\bm X^\top$ onto $\bm Y$ scaled by $\mathrm {diag}(\bm M\bm e)$. If we let $\Vert\bm W\Vert\to\infty$ during minimization, then $\bm M\to\bm Y^\top\implies\mathrm {diag}(\bm M\bm e)\to \bm I^n$. I guess then my question is equivalent to the following:
"Does there exist a diagonal matrix with distinct elements, i.e., $\mathrm{diag}(\bm M\bm e)\neq k\bm I^n$ for some finite $\Vert\bm W\Vert$ such that the above expression holds?"
Calculation of $\nabla_{\bm W}J$
There seems to be some confusion in comments regarding the expression I am getting for the gradients. Here is what I am doing:
$$\nabla_{\bm W}J=\bm X(\nabla_{\bm W^\top\bm X}J)^\top$$
where $\nabla_{\bm W^\top\bm X}J=\begin{bmatrix}\nabla_{\bm W^\top\bm x_1}J & \nabla_{\bm W^\top\bm x_2}J & \dots & \nabla_{\bm W^\top\bm x_n}J \end{bmatrix}$. Now, $\nabla_{\bm W^\top\bm x_i}J$ can be calculated as: $$\nabla_{\bm W^\top\bm x_i}J=\frac{J_i^2}{n(c-1)}\bm B^\top\sum\limits_{\bm y_i\neq\bm y_j}\exp\left((\bm y_j-\bm y_i)^\top \bm B\bm W^\top\bm x_i\right)(\bm y_j-\bm y_i) \\ =\frac{J_i^2}{n(c-1)}\bm B^\top\sum\limits_{\forall j}\exp\left((\bm y_j-\bm y_i)^\top \bm B\bm W^\top\bm x_i\right)(\bm y_j-\bm y_i) $$
Let $m_j=J_i^2\exp\left((\bm y_j-\bm y_i)^\top \bm B\bm W^\top\bm x_i\right)$, then $$\nabla_{\bm W^\top\bm x_i}J=\frac1{n(c-1)}\bm B^\top\sum\limits_{\forall j}m_j(\bm y_j-\bm y_i) \\ =\frac1{n(c-1)}\bm B^\top\left[\sum\limits_{\forall j}m_j\bm y_j-\bm y_i\sum\limits_{\forall j}m_j\right] $$
Since $\bm y_j$ is one-hot, The term in brackets can be rewritten in vector form. $$\nabla_{\bm W^\top\bm x_i}J=\frac1{n(c-1)}\bm B^\top\left[\bm m_i-\left(\bm m_i^\top\bm e\right)\bm y_i\right]$$
where $\bm m_i=\begin{bmatrix} m_1 & m_2 & \dots & m_c\end{bmatrix}^\top$. From here, one can work upwards, to find $\nabla_{\bm W}J$ as provided previously. Notably, $\bf m_i$ also turns out to be the $i$th row of $\bm M$ mentioned previously. Hopefully, I did not miss something.
$ \def\R#1{{\mathbb R}^{#1}} \def\h{\odot} \def\o{{\tt1}} \def\e{{\varepsilon}} \def\bR#1{\big[#1\big]} \def\BR#1{\Big[#1\Big]} \def\LR#1{\left(#1\right)} \def\op#1{\operatorname{#1}} \def\diag#1{\op{diag}\LR{#1}} \def\Diag#1{\op{Diag}\LR{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\frob#1{\left\| #1 \right\|_F} \def\qiq{\quad\implies\quad} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\LR{#1}}} \def\fracLR#1#2{\LR{\frac{#1}{#2}}} \def\tfracLR#1#2{\LR{\tfrac{#1}{#2}}} \def\HH{H^{-2}} \def\H{H^{-1}} $This is too big for a comment, but here is how I calculated the gradient. In addition to the Hadamard product $(\h)$ I'll use the Frobenius product $(:)$ $$\eqalign{ &A:B = \sum_{i=1}^m\sum_{j=1}^n \LR{A\h B}_{ij} \;=\; \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \\ &B:B = \frob{B}^2 \qquad \{ {\rm Frobenius\;norm} \}\\ &A:B = B:A \;=\; B^T:A^T \;=\; \trace{A^TB} \\ &\LR{XY}:B = X:\LR{BY^T} \;=\; Y:\LR{X^TB} \\ &C:\LR{A\h B} = \LR{C\h A}:B \\ }$$ Let $\{\e_k\in\R n\}$ denote the standard basis vectors and note that $$\eqalign{ x_k = X\e_k,\qquad y_k = Y\e_k \\ }$$ The following variables are useful $$\eqalign{ I &= \sum_k \e_k\e_k^T \qquad{\;\;\rm and}\quad\;\; e = \sum_k \e_k \\ P &= X^TWB^TY \qiq p = \diag P \;=\; \sum_k \e_k\LR{\e_k^TP\e_k} \\ G &= \sum_j\sum_k \e_k \BR{\e_k^TP\LR{\e_j-\e_k}} \e_j^T \;\equiv\; \LR{P - pe^T} \\ A &= \exp(G) \qiq dA = A\odot dG \\ H &= (c-2)I + \Diag{Ae} \;=\; H^T \\ M &= \tfracLR{c-1}{n}\BR{\diag\HH\:e^T}\h A \\ }$$ where the
exp()function is applied elementwise.Rewrite the cost function using the above notation, and calculate its gradient $$\eqalign{ J &= \tfracLR{1-c}{n}I:\H \\ \\ dJ &= \tfracLR{c-1}{n}I:\H\,dH\,\H \\ &= \tfracLR{c-1}{n}\HH:dH \\ &= \tfracLR{c-1}{n}\HH:\Diag{dA\;e} \\ &= \tfracLR{c-1}{n}\diag\HH\:e^T:\c{dA} \\ &= \tfracLR{c-1}{n}\diag\HH\:e^T:\CLR{A\odot dG} \\ &\equiv M:dG \\ &= M:\LR{dP-\c{dp}\,e^T} \\ &= M:dP - Me:\c{\diag{dP}} \\ &= \BR{M-\Diag{Me}}:dP \\ &= \BR{M-\Diag{Me}}:\LR{X^T\,dW\:B^TY} \\ &= X\BR{M-\Diag{Me}}Y^TB:dW \\ \\ \grad{J}{W} &= X\BR{M-\Diag{Me}}Y^TB \\ }$$