How to interpret E in this formula?

58 Views Asked by At

I try to understand the meaning of E in this formula. The formula is taken from https://arxiv.org/pdf/1710.09412.pdf and it is on the 3rd page. The paper is about minimising vicinal risk loss. $\lambda$ is a value from a beta distribution.

\begin{equation} \mu\left(\tilde{x}, \tilde{y} \mid x_{i}, y_{i}\right)=\frac{1}{n} \sum_{j}^{n} \underset{\lambda}{\mathbb{E}}\left[\delta\left(\tilde{x}=\lambda \cdot x_{i}+(1-\lambda) \cdot x_{j}, \tilde{y}=\lambda \cdot y_{i}+(1-\lambda) \cdot y_{j}\right)\right] \end{equation}

Does E represent vicinal distribution or is it expectation?

2

There are 2 best solutions below

0
On BEST ANSWER

The expectation is with respect to $\lambda \sim \text{Beta}(\alpha, \alpha)$. Pretend that $\lambda =Z$ if it makes it easier. Technically it is the following:

$$\mu=\frac 1 n\sum_{j=1}^n \int_0^1 \delta(...) \frac 1{B(\alpha, \alpha)}\lambda^{2(\alpha-1)}d\lambda$$

The resulting vicinal distribution $\mu(\tilde x, \tilde y|x_i,y_i)$ allows for linear interpolation of classes, i.e. if two data points are of different classes there should be a linear gradient of probability that the points in between are of class 0 or 1. Points can be simulated from this distribution by randomly picking $m$ points $(x_i, y_i), i=1,2,...,m$ from the observed data and then generating from the conditional distribution $\mu$. Points will no longer be exactly $(x_i, 0)$ or $(x_i, 1)$ but rather in a vicinity with respect to both $x_i$ and $y_i$, differing from the Chapelle et al. article which augmented the data with only Gaussian noise.

Their motivation was stated at the beginning:

On the other hand, neural networks trained with ERM change their predictions drastically when evaluated on examples just outside the training distribution (Szegedy et al., 2014), also known as adversarial examples. This evidence suggests that ERM is unable to explain or provide generalization on testing distributions that differ only slightly from the training data. ... Therefore, mixup extends the training distribution by incorporating the prior knowledge that linear interpolations of feature vectors should lead to linear interpolations of the associated targets.

enter image description here

Left is a picture with orange being class 1, green being class 0. mixup is what they are proposing. Intensity of blue is the probability of being class 1.

0
On

The only random variable is $\lambda$ and the E means we’re taking the expected value over that random variable.

The entire thing is the vicinal distribution.

Basically, they’re saying that the standard method of training is that your input data is a discrete uniform distribution over your inputs. Their modification is for linear combinations of inputs (and associated outputs) to also be included in the training data.