I am reading Elements of Information Theory and I have come across this formula: \begin{align} R^{(I)}(D) = \min_{p(\hat {x}|x):\sum_{(x,\hat x)}p(x)p(\hat x|x)d(x,\hat x) \leq D}I(X; \hat{X}) \end{align}
What is the intuition behind choosing $p(\hat x|x)$ to minimize that expression? For channel capacity we used conditional probabilities to model the fact that the channel can corrupt certain symbols with some probability. What exactly are we modelling here with this probability distribution? It seems to me that we have two functions: an encoder and a decoder that are both deterministic so I don't see where the probabilities come from. $p(x)$ makes sense but not the conditional.
It appears that you think that $\hat{X}$ should be an explicit function of $X$, say $\hat{X} = g(X)$ for some numerical function $g(\cdot)$. In that case, indeed, it makes more sense to optimize over $g(\cdot)$ since the distribution $p(\hat{x}\mid x)$ is, trivially, equal to $\delta(\hat{x}-g(x))$. However, note that optimizing over $p(\hat{x}|x)$ also covers the above case, i.e., the definition is more general.
Now, what you should understand is that the optimal encoder is not necessarily of the form $\hat{X} = g(X)$ as discussed above. In the general case, the optimal encoder operates as follows:
Note that this does not mean that the encoder is random in the sense that, for the same $X$, it generates a random $\hat{X}$! It means that for the same $X$, it always generates the same $\hat{X}$ (known both by encoder and decoder), however, the value of $\hat{X}$ that is used is obtained by generating (offline) a single realization of a random variable.
See the "Gaussian source" example in the "Elements" as an example of a source where the above approach is indeed optimal.
This procedure is exactly parallel to the transmission capacity scenario, where the set of codewords ("codebook") is fixed during transmission, however, the codeword values are selected offline as a (single) realization of a random variable.