Multivariate Gaussian distribution over error function resulting in Arcsin?

189 Views Asked by At

TL;DR: what is the relation between Arcsin, Gaussian distributions and the erf function?

In a paper by Saad and Solla from 1995 (https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.74.4337) a neural network's error is averaged over all possible input vectors, so as to obtain an expression for the generalization error of the network. Obtaining this expression the erf function (the activation function of the neurons) is averaged over a gaussian distribution and the result contains the arcsine function somehow, and I am trying to understand how this is arrived at.

The situation is as follows:

So called 'activations' are defined as

$x_i = \boldsymbol{J}_i\cdot \boldsymbol\xi, y_i = \boldsymbol{B}_i\cdot \boldsymbol\xi$

where $\boldsymbol J_i$ represent the network, and $\boldsymbol{B}_i$ represent the 'teacher' network (these are fixed).

With $\boldsymbol J = \{ \boldsymbol J_i\}_{1\leq i \leq K}$, training error is defined as

$\epsilon(\boldsymbol{J}, \boldsymbol{\xi}) = \frac{1}{2}\left[ \sum_{i=1}^K \operatorname {erf}(x_i) - \sum_{n=1}^M \operatorname {erf}(y_n)\right]^2$

The components of $\boldsymbol{\xi}$ are uncorrelated random variables with zero mean and unit variance. Therefore, averaging over all possible inputs to obtain the generalization error is performed implicitly through averages over the activations $\boldsymbol{x} = (x_1, ..., x_K), \boldsymbol y = (y_1, ..., y_M)$

The covariance matrix C is then given by $C = \begin{bmatrix} Q & R \\ R^T & T \end{bmatrix}$

with

$Q_{ik} = \langle x_ix_k\rangle = \boldsymbol{J}_i \cdot \boldsymbol{J}_k$,

$R_{in} = \langle x_iy_n\rangle = \boldsymbol{J}_i \cdot \boldsymbol{B}_n$ and

$T_{nm} = \langle y_iy_m\rangle = \boldsymbol{B}_n \cdot \boldsymbol{B}_m$

Then the averages over $\boldsymbol{x}$ and $\boldsymbol{y}$ are performed using the multivariate Gaussian distribution:

$$P(\boldsymbol{x}, \boldsymbol{y}) = \frac{1}{\sqrt{(2\pi)^{K+M} |C|}}\exp\{-\frac{1}{2} (\boldsymbol x, \boldsymbol y)^T C^{-1} (\boldsymbol x, \boldsymbol y)\}$$

and taking the average of the training error over this distribution. Then somehow, the author arrives at:

$$ \epsilon_g(\boldsymbol{J}) = \frac{1}{\pi} \{\sum_{ik} \arcsin \frac{Q_{ik}}{\sqrt{1 + Q_{ii}} \sqrt{1 + Q_{kk}}} +\sum_{nm} \arcsin \frac{T_{nm}}{\sqrt{1 + T_{nn}} \sqrt{1 + T_{mm}}}\\ -2\sum_{in} \arcsin \frac{R_{in}}{\sqrt{1 + R_{ii}} \sqrt{1 + R_{nn}}} \} $$ I suspect the 3 separate terms in this expression correspond to the expansion of the square in the expression of training error, but that's about as far as I get: I split the average up into 3 different averages, each over a different segment of $(\boldsymbol x, \boldsymbol y)$ space, but I fudamentally don't understand the connection between this the Gaussian distribution, the erf function and the Arcsin function.

Could anyone give any pointers as to how this works? Thanks in advance!