Rewriting channel transition probability

32 Views Asked by At

I have a channel transition probability for the Mixture-of-Experts (MoE) model in machine learning: \begin{align*} &\mathbb{P}\Big[y_i\Big|\langle X_i,\beta^{(1)}\rangle,\dots,\langle X_i,\beta^{(L)}\rangle,\langle X_i,w^{(1)}\rangle,\dots,\langle X_i,w^{(L)}\rangle\Big] \\ &\qquad=\sum_{l=1}^L \frac{\exp(\langle X_i,w^{(l)}\rangle)}{\sum_{l'=1}^L\exp(\langle X_i,w^{(l')}\rangle)}\mathcal{N}\Big(y_i\Big|g(\langle X_i,\beta^{(l)}\rangle),\sigma^2\Big), \end{align*} where $\mathcal{N}(y_i|g(\langle \beta^{(l)},X_i\rangle),\sigma^2)$ refers to the probability density function of Gaussian random variable with mean $g(\langle \beta^{(l)},X_i\rangle)$ and variance $\sigma^2$, where $g(\cdot)$ is some non-linear activation function. Note that $\langle\cdot,\cdot\rangle$ denotes the inner product between vectors. So the output $y_i$ essentially chooses one of the $g(\langle X_i,\beta^{(l)}\rangle)$ with probability $\frac{\exp(\langle X_i,w^{(l)}\rangle)}{\sum_{l'=1}^L\exp(\langle X_i,w^{(l')}\rangle)}$ (like a categorical distribution) and adds a Gaussian noise $\mathcal{N}(0,\sigma^2)$ to it.

I am wondering if we can rewrite the channel transition probability in terms of \begin{align*} y_i=q\Big(\langle X_i,\beta^{(1)}\rangle,\dots,\langle X_i,\beta^{(L)}\rangle,\langle X_i,w^{(1)}\rangle,\dots,\langle X_i,w^{(L)}\rangle,\Psi_i\Big), \end{align*} where $\Psi_i$ is to represent a vector of other variables (that are up to our own choosing), i.e., they provide more freedom to what the function can be. Essentially, I'm asking for an explicit form of $q(\cdot)$ -- more details in my attempt below (things should become clearer). The replies to a previous post seem to indicate that this is possible.

My attempt: I wasn't able to figure this out for the problem above but I managed to figure it out for a similar but simpler problem which can shed some light into what I am looking for. For the logistic regression model where $y_i\in\{0,1\}$, we have: \begin{align*} \mathbb{P}\Big[y_i=1\Big|\langle X_i,\beta^{(1)}\rangle\Big] &=\frac{\exp(\langle X_i,\beta^{(1)}\rangle)}{1+\exp(\langle X_i,\beta^{(1)}\rangle)}. \end{align*} Then we can have an equivalent representation \begin{align*} y_i &=q(\langle X_i,\beta^{(1)}\rangle,\Psi_i) =\boldsymbol{1}\bigg[\Psi_i\leq \frac{\exp(\langle X_i,\beta^{(1)}\rangle)}{1+\exp(\langle X_i,\beta^{(1)}\rangle)}\bigg], \end{align*} where in this specific case, $\Psi_i$ is just a single element with distribution $\Psi_i\sim\text{Uniform}[0,1]$. Note that $\boldsymbol{1}[\cdot]$ is the identity function. I am looking for this type of representation for the MoE problem introduced above. I feel like a similar idea with the auxiliary uniform random variable might also work for the MoE problem but I can't seem to figure it out... Though I emphasize that we can use any other auxiliary variables that we deem are suitable (doesn't have to be a uniform random variable).