I have been reading whatever sources I could get my hands on today, regarding this problem. Most notes online about rate distortion theory come from the book Elements of Information Theory by Thomas M. Cover and Joy A. Thomas. The book seems well regarded so I assume i am misunderstanding something.
I.e the following slides. I am confused by the notation - which comes from the book. If we look at slides 14 and 16 in the slides deck linked. Firstly on slide 14 they denote a sequence of random variables (i.e a sequence of bits, as far as I understand) as $X^n$ $$ X^n = (X_{1}, \ldots, X_{n}), \quad X_{n} \sim p(x) $$ and the codeword for that sequence as $\hat{X}^n$. Now moving on to slide 16 they seem to mix between $X^n$ and $x^n$ i.e they mention that the distortion between sequences $$ d\left(x^{n}, \hat{x}^{n}\right)=\frac{1}{n} \sum_{i=1}^{n} d\left(x_{i}, \hat{x}_{i}\right) $$
And distortion for a $\left(2^{n R}, n\right)$ code: $$ D=E d\left(X^{n}, g_{n}\left(f_{n}\left(X^{n}\right)\right)\right)=\sum_{x^{n}} p\left(x^{n}\right) d\left(x^{n}, g_{n}\left(f_{n}\left(x^{n}\right)\right)\right) $$
but as far as i can tell they are the same thing? What is the idea behind this, if any? Am missing something. This notation is consistent across all sources I could find.
$X$ is a random variable whilst $x$ is an observation. In the same way, $X^n$ is a joint random variable that consists of $n$ i.i.d random variables whilst $x^n$ is an observation (or realisation) of the random variable $X^n$.
For instance, let $X$ be a random variable that represents a coin toss (you can for instance assume heads = $1$ and tails = $0$). In this case $X \sim \text{Bern}(1/2)$. If you observe $n$ iid variables as the same time then you have the random variable $X^n$. If you sample $X^n$ then you get a $n$ -bit vector which is denoted as $x^n$. In our example, there would be $2^n$ different values that $x^n$ can take.
So $X^n$ is the joint random variable $(X_1, X_2, ..., X_n)$ whilst $x^n = x_1x_2...x_n$ is a realisation.
Let $x^n$ and $\hat{x}^n$ be two different $n$-bit vectors, where $\hat{x}^n = \phi(x^n) $ for some arbitrary function $\phi(\cdot)$, then the distortion due to the application of $\phi(\cdot)$ is $d(x^n, \phi(x^n)) = d(x^n, \hat{x}^n)$, where $d$ is some chosen distortion function.
If each $x_i$ is independent then $d(x^n, \hat{x}^n) = \sum _{i} d(x_i, \hat{x}_i)$ where the distortion in this case will be the distortion per $n$ bits. If the distortion needs to be calculated on a 'per bit' basis then you can just divide by $n$, i.e. $d(x^n, \hat{x}^n) = \frac{1}{n} \sum _i d(x_i, \hat{x}_i)$.
If we want to find the expected distortion caused by the application of $\phi(\cdot)$ to $X^n$ then we can compute
\begin{align} \mathbb{E}[d(X^n , \phi(X^n)] = \sum_{x^n \in \mathcal{X}^n} p({x^n}) d(x^n , \phi(x^n)) \end{align}
$\mathcal{X}^n$ is the set of the $2^n$ possible values of $x^n$. The distortion of a code is defined as being $\mathbb{E}[d(X^n , \phi(X^n)]$ where $\phi(\cdot) = g_n(f_n(\cdot))$. Here, $f_n(\cdot)$ is the encoder and $g_n(\cdot)$ is the decoder.