I'm curious about the relationship between the Cross-Entropy function and the sum-log-loss function when doing multi-class logistic regression applied to neural networks.
A setup would be as follows:
Suppose $\Omega=\{1,2,...,K\}$ with $K\geq2$, and we have our training data $(x, y)$ where $x\in\mathbb{R}^{k+n}$ ($k$-features) and our target $y\in\Omega^n$. Using the identification $i\mapsto e_i$, where $\{e_j\}$ is the basis for $\mathbb{R}^K$, we assume $\Omega$ is this collection of basis vectors, that is, $$y_j^i=\begin{cases} 1&\text{if } y_j=i\\ 0&\text{else.} \end{cases}$$ Then we have the loss function $$J(\theta)=-\frac{1}{n}\sum_{j=1}^n\sum_{i=1}^K\left(y^i_j\log(\hat{y}^i_j)+(1-\hat{y}^i_j)\log(1-\hat{y}^i_j\right),$$ where $\hat{y}$ is our predicted $y$-value given by $$\hat{y}^i_j(\theta)=(1+e^{-\langle \theta^i, x_j\rangle})^{-1},$$ and $\theta\in\mathbb{R}^{K\times(k+1)}$ is our weight-parameter found by minimizing $J$.
We also have the Cross-Entropy function given by \begin{align} \tilde{J}(\theta)&=-\frac{1}{n}\sum_{j=1}^n\sum_{i=1}^Kp^i_j\log(q^i_j(\theta))\\ &=-\frac{1}{n}(p:\log(q(\theta))), \end{align} where $:$ denotes the Frobenius inner product on $K\times n$-matrices.
This leads me to my questions. I would prefer to use $\tilde{J}$, as this makes computing derivatives fairly trivial, but I'm not seeing how $\tilde{J}=J$. How do we define $p$ and $q$ explicitly in terms of $y$ and $\hat{y}$ (and $\theta$), so that $J=\tilde{J}$?
I come from a background in differential geometry, and just started dabbling in machine learning as a hobby. Unfortunately, my probability theory is bit lacking (almost exclusively as examples in a real analysis course near 10 years ago), so if anyone has any references for this subject that would align with my background, it would also be greatly appreciated.