Say I have a variable $x$ whose PDF is a von Mises distribution with a single parameter, $\kappa$, that determines the distribution's concentration: $VM(x; \kappa)$.
But suppose $\kappa$ itself is not fixed, but drawn from a Gamma distribution: $G(\kappa; \bar{\kappa}, \tau)$, where the mean is $\bar{\kappa}$ and the variance is $\bar{\kappa}\tau$.
An author whose work I'm replicating has this to say about this situation: "The predictions [for each $x$ value's probability] have to be averaged over all possible values of precision $\kappa$ (mathematically: we need to 'marginalize out $\kappa$'). We do this by discretizing the (gamma) distribution over $\kappa$ into 50 bins with equal probability masses, computing the model [von Mises] prediction at each bin center, and then averaging the predictions. This solution is accurate as long as the number of bins is not too small (in the limit of an infinite amount of bins, the solution is equal to an analytical marginalization)" (emphasis mine).
I've read elsewhere that probabilities should be summed, not averaged, in marginalization. Another confusion I have regards this author's statement that the average should be "over all possible values of $\kappa$." This seems to imply that each possible value of $\kappa$ should be treated equally -- but then we're talking about a uniform distribution, not a Gamma one. Clearly, the author was using language loosely.
Can someone shed some light on this, maybe with language that is more precise?
This is basically a hierarchical model. Let $X | \theta$ be some random variable that follows some parametric distribution with parameter $\theta$, where $\theta$ itself follows some other parametric distribution with hyperparameter $\lambda$. (Note that these quantities may be vector-valued to handle the case where the distributions described have more than one parameter/hyperparameter).
Then the marginal distribution of $X$ is given by $$f_X(x) = \int_{\theta \in \Omega} f_{X \mid \theta}(x | \theta) f_{\theta}(\theta) \, d\theta,$$ where $\Omega$ is the support of the distribution of $\theta$. This is in some sense a weighted average of the conditional density of $X$ given $\theta$, weighted by the density of $\theta$ over its support.
There are several ways to perform a numerical approximation by discretization of this integral.
Method 1
One way is to sample from $\Omega$ in a uniform manner, calculate the density at each point, and use these as the weights in a discrete sum $$f_X(x) \approx \sum_{k = 1}^\infty f_{x \mid \theta}(x | \theta_k) f_\theta(\theta_k) \Delta_k,$$ where $\Delta_k = \theta_{k+1} - \theta_k$ is the interval between the regularly spaced samples $\{\theta_k\}_{k=1}^\infty$.
Method 2
Another way would be to choose a variable spacing; e.g., $$\theta_k = \psi(k),$$ for some function $\psi$ that is related to the density in a way that $$\Pr\left[\frac{\psi(k-1) + \psi(k)}{2} < \theta \le \frac{\psi(k) + \psi(k+1)}{2}\right] = \text{constant}, \quad k = 1, 2, \ldots.$$ If you do this, then the weighting by the density is no longer needed, and you only need to average by the number of samples.
Method 3
Finally, the method that is proposed in the question itself is to set some positive integer $N$, and calculate a sequence $\{\varphi(k)\}_{k=0}^N$ with $\varphi(0) = -\infty$ and $\varphi(N) = \infty$ such that $$\Pr[\varphi(k-1) < \theta \le \varphi(k)] = 1/N$$ for all $k = 1, 2, \ldots, N$. Then define $\theta_k = (\varphi(k-1) + \varphi(k))/2$ and use this as the points to be sampled (this, as well as the previous method described, is problematic at infinity, but can be addressed somewhat by using large $N$ and ignoring the tails).
For this last method, as for the second one, weighting by density is not needed; you only need to average by the number of samples.