This is claimed in Proposition 1 in the paper http://arxiv.org/abs/1409.6110
Let $A$ be a $n \times d$ matrix. $A$ can have only $K$ different types of rows i.e. rows of $A$ are chosen from a set of $K$ vectors in $\mathbb{R}^d$. $A$ is not random.
Consider the model $$r = A\theta^* + e$$ where each $e_i$ is iid, mean 0, and bounded in $[-\sigma,\sigma]$. Let $\hat{\theta}$ be the least squares estimate of $\theta^*$ i.e. $$\hat{\theta} = \arg \min_\theta \|A\theta-r\|^2$$ or $\hat{\theta} = (A^T A)^{-1} A^T r$. The result claims that for $x$ chosen from the set of $K$ vectors, we have for any $n \in \mathbb{N}$ $$P\left(\left| x^T \theta^* - x^T \hat{\theta} \right| \leq c \|x\|_{(A^T A)^{-1}}\sqrt{\log (c'n^2 K/\delta)}\right) \geq 1-\delta$$ where $c = 2\sigma\sqrt{2}$ and $c'=6/\pi^2$.
Any idea how I can prove this? Apparently it can be proved using Azuma's inequality. I don't even know where to start, I would appreciate any ideas/hints.