Let's say that we use a third-order approximation of the English language for a block of text, and that the average entropy is 2.5 bits.
I was thinking that by Shannon's source coding theorem, doesn't that mean that, on average, we need only 2.5 bits to encode each character?
To my knowledge, current compression methods don't factor in the dependencies of characters following other characters (for example, a 'u' being followed by a 'q' occurs with probability 1, hence should ideally take up no space when being encoded since the u has effectively 0 bits of information).
What would such an encoding mechanism look like - i.e., how would an encoding mechanism that factors in the previous characters for encoding subsequent characters be implemented?
Yes, they do. In several ways.
With Huffman coding, you can code the $n-$extension of the source. For example, for $n=2$, you consider each pair of symbols as a new macro-symbol, and compute their probabilities (in your example, the pair "qu" would have a significant probability, while the pair "qs" would have zero probability), and build the corresponing Huffman tree using those pairs.
Or you can also compute the conditional probabilities and use a bank of Huffman coders, where you choose a coder based on the previous symbol.
Or you can use arithmetic coding, which automatically adapts to the conditional probabilities (not only first order).
Or you can use a typical universal compressor, like Lempel-Ziv, (LZ-77 or LZ-78 or some variant), which does not use probabilities but exploit repeating patterns.