Stochastic Neighbor Embedding (SNE) - How to Understand the Cost Function of the Kullback Leibler (KL) Divergence

96 Views Asked by At

In the paper 'Stochastic Neighbor Embedding', the cost function in term of K-L divergence is

$$ C = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}} = \sum_i \operatorname{KL}(P_i \| Q_i). $$

Question 1: Why does it say that SNE is an improvement over methods like LLE or SOM in which widely separated data-points can be 'collapsed' as near distances?

To minimize the cost, there is always a small distance in low-dimensional space with large $q_{ij}$, which also can cause 'collapse'. Or does 'collapse' mean something else?

Question 2: Similarly, I do not understand how SNE keep the images of widely separated objects relatively far apart.

1

There are 1 best solutions below

0
On BEST ANSWER

Question 1

The paper discusses this in the following quote:

Notice that making $q_{ij}$ large when $p_{ij}$ is small wastes some of the probability mass in the $q$ distribution so there is a cost for modeling a big distance in the high-dimensionalspace with a small distance in the low-dimensional space, though it is less than the cost of modeling a small distance with a big one. In this respect, SNE is an improvement over methods like LLE [4] or SOM [5] in which widely separated data-points can be “collapsed” as near neighbors in the low-dimensional space. The intuition is that while SNE emphasizes local distances, its cost function cleanly enforces both keeping the images of nearby objects nearby and keeping the images of widely separated objects relatively far apart.

Ideally, to minimize the KL, we should have $q_{ij} \approx p_{ij}$. Their point is that there is an asymmetry wrt increasing one compared to the other. Note that $p$ is inversely related to distance in the high-dimensional (HD) space, while $q$ is inversely related to distance in the low-dimensional (LD) embedding space. Large $q_{ij}$ with small $p_{ij}$ means modelling a large HD distance with a small LD distance. This penalty is born out in other terms of the sum, where $q_{k\ell}$ will be too small because some of the density of $q$ was "wasted" on $p_{ij}$. On the other hand, modelling a small HD distance with a large LD distance means having a large $p_{ij}$ with a small $q_{ij}$, meaning that $i,j$ will be a rather punishing term. I think what they mean is that without this asymmetry, this collapsing is more likely in LLE and SOM, perhaps.

Question 2

Basically re-iterating: to keep the images of widely separated objects relatively far apart, we want $q_{ij}$ to be small when $p_{ij}$ is small.

If the HD distance is far, then $p_{ij}$ will be small, (which makes it seem as if that term doesn't really matter as much since that it is $ p_{ij}\log(p_{ij}/q_{ij}) $). However, if the LD distance used there is small (meaning we didn't keep them far apart), then $q_{ij}$ will be large. Due to the normalization of $q$, this means we have "wasted" some of its probability mass here, and so other terms will be higher as a result. Hence the method will be driven to avoid this.