I'm trying to understand t-SNE, based on this video, and everything is clear except for one key part of the logic right at the time point linked to above: namely, minimizing the Kullback-Leibler divergence.
The stated problem is that we want to choose $ q_{ij} $ such that $$ \sum\limits_{i}\sum\limits_{j \neq i} p_{ij} {\rm log} \left( \frac{p_{ij}}{q_{ij}} \right) $$ is as small as possible, and
$$ q_{ij}=\frac{(1+|y_i-y_j|^2)^{-1}}{ \sum\limits_{k} \sum\limits_{l\neq k} \left( 1+|y_{k}-y_{l}|^2 \right)^{-1}} $$
(I'm a little unsure about those $^{-1}$ exponents in the expression, but I'm copying what's in the video)(Edit: they're right. The quantities are weighted inversely to squared distances)
I understand that $q_{ij}$ is constrained to the range $0..1$, but ... these aren't probabilities (Edit: They actually are referred to as "probabilities of choosing each other as neighbours"), they're just normalized distances, so why is this a sensible comparison? does this just work for any set of numbers that are confined to the 0..1 range and which add up to one?
(I updated this question because I missed something obvious, but I'm still stuck. Thanks for any help you can offer)
Hopefully, the following will help:
If $\{a_i\}$ and $\{b_i\}$ are non-negative and $b_i > 0$, then the following holds
\begin{equation} \sum_i a_i \log \left(\frac{a_i}{b_i}\right) \ge \left(\sum_j a_j\right)\log \left(\frac{\sum_i a_i}{\sum_k b_k}\right). \end{equation}
This inequality is called the log-sum inequality and is often encountered when studying information theory and high dimensional statistics.
In your case, $q_{ij}$ is positive and sums to 1, when summed over all $i$ and $j$, $j\ne i$. So it technically is a probability. As for whether RHS is non-negative or not depends on $p_{ij}$. I am assuming it is non-negative and not all zero.