Given a probability distribution $x\mapsto p(x)$, some integer $n>0$, and some $\epsilon>0$, one defines $\epsilon$-typical sequences as those sequences $\boldsymbol{x}\equiv(x_1,...,x_n)$ $$\left|-\frac1 n \log[p(\boldsymbol{x})] - H(X) \right| < \epsilon,$$ where $p(\boldsymbol{x})\equiv \prod_{k=1}^n p_X(x_k)$.
My rough intuition for typical sequences was that they are the ones containing $np_i$ copies of the $i$-th symbol, for each $i$, and this intuition was backed-up by the fact that $$\binom{n}{np}p^{np} (1-p)^{n(1-p)} \to 1, \qquad n\to\infty, \\ p^{np} (1-p)^{n(1-p)} = 2^{-nH(X)}.$$ However, I recently realise that, for binary sequences, when $p_0=p_1=1/2$, going by the definition of typical sequences, all binary sequences are typical, regardless of $\epsilon$ and $n$, as $p^{np}(1-p)^{n(1-p)}=2^{-n}=2^{-n H(X)}$.
So what gives? If the intution that typical sequences are those with $np_i$ copies of each symbol tenable at all? Does it only fail for balanced probability distributions, or are there other "edge cases" to be mindful of?
In the maximal entropy case [$p=1/2$ for binary] all sequences are typical, it is a feature of the definition.
If $p$ is far from $1/2$ say in $(0,1/4)$ then there are way fewer sequences and you have compressibility, after all with high probability you only have to encode $2^{nH(p)}$ sequences, the rest will occur with tiny probability $\epsilon$, say, and you just use a header symbol and don't encode those. This will make little difference to the expected length of your source code.
And yes, your intuition is correct in that.
However you can prove stronger results using strong typicality.
Edit: See, for example, section 10.6 in Cover and Thomas' book Elements of Information Theory (2nd ed.) entitled Strongly Typical Sequences and Rate Distortion. Essentially tight two sided (both lower and upper) convergence bounds can be proved for quantities of interest. I copied one result below: