Virtually all proofs I have seen of the Chernoff-Hoeffding bound (and its variations/generalizations like Azuma's inequality) are derived by taking the expectation of the moment generating function $e^{tX}$ of the sum $X=\sum_{i=1}^n X_i$, and "massaging" it. It's "almost by accident" that, if the $X_i$ are independent indicator variables with average expectation $p$ (i.e. $\frac{1}{n}\sum_{i=1}^n E[X_i]=p$) we obtain the beautiful:
$$Pr(X\geq (1+\epsilon)E[X])~\leq~2^{-nH_p(1+\epsilon)p}$$
where $H_p(x)$ is the relative entropy/information gain/Kullback-Lieber divergence of $x$ in respect to $p$ (in bits, i.e. using $\log_2$; substitute $e^{-\dots}$ in the formula above if $H_p$ is expressed in nats, i.e. using $\log_e$). This has a very intuitive interpretation: every extra bit of "surprisal" in an outcome lowers the probability of seeing that outcome by a factor $2$.
Can one derive the Chernoff-Hoeffding bound directly through an information theoretic argument? Something along the lines "if encoding this requires $b$ bits, then it can't occur with probability greater than $2^{-b}$"? Shouldn't this easily cover the case when the $X_i$ are dependent, too?