As I understand the KL Divergence, it measures how different two probability distributions $P$ and $Q$ are.
However, say the two distributions are:
P = [0.2 0.2 0.2 0.4];
Q = [0 0.5 0.1 0.4];
Then the KL-divergence is infinity. What is the justification for these distributions being infinitely different?
One interpretation of the Kullback-Leibler divergence $D_{KL}(P \Vert Q)$ is that it measures the extra message length per character needed to send a message drawn from distribution $P$ using an encoding scheme optimized for distribution $Q$. If one character is assigned a very small probability by $Q$, then the encoding scheme will use a very large number of bits to encode that character, since it is expected to occur so rarely. (By Kraft's inequality, the number of bits should be $-\log_2(q_i)$, where $q_i$ is the probability of the unlikely character.) If in fact it occurs frequently in distribution $P$, then a message drawn from $P$ will take a large number of bits to encode (specifically, $-p_i \log_2(q_i)$ per character on average). Therefore, if the $i$-th component of $Q$ goes to zero while the corresponding component of $P$ remains fixed and positive, the KL divergence $D_{KL}(P \Vert Q)$ should (and does) diverge logarithmically as $-p_i \log_2(q_i)$.