In the paper 'Stochastic Neighbor Embedding', the cost function in term of K-L divergence is
$$ C = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}} = \sum_i \operatorname{KL}(P_i \| Q_i). $$
Question 1: Why does it say that SNE is an improvement over methods like LLE or SOM in which widely separated data-points can be 'collapsed' as near distances?
To minimize the cost, there is always a small distance in low-dimensional space with large $q_{ij}$, which also can cause 'collapse'. Or does 'collapse' mean something else?
Question 2: Similarly, I do not understand how SNE keep the images of widely separated objects relatively far apart.
Question 1
The paper discusses this in the following quote:
Ideally, to minimize the KL, we should have $q_{ij} \approx p_{ij}$. Their point is that there is an asymmetry wrt increasing one compared to the other. Note that $p$ is inversely related to distance in the high-dimensional (HD) space, while $q$ is inversely related to distance in the low-dimensional (LD) embedding space. Large $q_{ij}$ with small $p_{ij}$ means modelling a large HD distance with a small LD distance. This penalty is born out in other terms of the sum, where $q_{k\ell}$ will be too small because some of the density of $q$ was "wasted" on $p_{ij}$. On the other hand, modelling a small HD distance with a large LD distance means having a large $p_{ij}$ with a small $q_{ij}$, meaning that $i,j$ will be a rather punishing term. I think what they mean is that without this asymmetry, this collapsing is more likely in LLE and SOM, perhaps.
Question 2
Basically re-iterating: to keep
the images of widely separated objects relatively far apart, we want $q_{ij}$ to be small when $p_{ij}$ is small.If the HD distance is far, then $p_{ij}$ will be small, (which makes it seem as if that term doesn't really matter as much since that it is $ p_{ij}\log(p_{ij}/q_{ij}) $). However, if the LD distance used there is small (meaning we didn't keep them far apart), then $q_{ij}$ will be large. Due to the normalization of $q$, this means we have "wasted" some of its probability mass here, and so other terms will be higher as a result. Hence the method will be driven to avoid this.