I am looking to estimate the conditional distribution of the next observation $x_{t+1} \in \mathbb{R}_+$ of a discrete-time process, given the current observation and $l$ previous observations. I am unsure whether the term conditional random field (CRF) is appropriate here, but the model is, following literature in hidden CRF but taking the hidden state to be a continuous random variable, $h \in \mathbb{R}^d$:
$$ p(x_{t+1} | x_{t-l:t}) = \int dh_{t-l:t+1} \ p(x_{t+1}, h_{t-l:t+1} | x_{t-l:t}) $$
and using a parametric $\psi$, which is real-valued, defined by the graphical model
$$ p(x_{t+1}, h_{t-l:t+1}) \propto \psi(x_{t+1}, h_{t-l:t+1}, x_{t-l:t}) $$
With normalisation by integrating over all $x_{t+1}$ and $h_{t-l:t+1}$, we finally have that
$$ p(x_{t+1} | x_{t-l:t}) = \frac{\int dh_{t-l:t+1} e^{\psi(x_{t+1}, h_{t-l:t+1}, x_{t-l:t})} }{\int dh'_{t-l:t+1} \int dx'_{t+1} e^{\psi(x'_{t+1}, h'_{t-l:t+1}, x_{t-l:t})} } $$
For large enough $d$, direct integration is infeasible computationally, and the first thought that came to me was to use importance sampling, which also allows me to include some prior knowledge/constraints. In particular, for the hidden state, it seems reasonable that it would not differ significantly between two consecutive timesteps, so for the integral over $h$,
$$ \int dh \ e^{\psi} \simeq \mathop{\mathbb{E}}_{h \sim q(h)} \left[ \frac{e^{\psi}}{q(h)} \right] $$
with the sequence $h_{t-l:t+1}$ sampled as follows: $h_{t-l} \sim \mathcal{U}(-1, 1)$ (each element of the vector sampled independently from a uniform distribution), and $h_{t+1} | h_t \sim \mathcal{N}(h_t, 0.1 \times I_d)$ where $I_d$ is the identity matrix of size $d \times d$.
The problem is that, for some fixed sequence $x_{t-l:t}$ and fixed $x_{t+1}$, at $N_h = 1000$ samples of $h$, with $d = 10$, the standard deviation of $p(x_{t+1} | x_{t-l:t})$ over 100 computations is $10^5$. In fact, I wanted to directly optimise the conditional density by using its gradient with respect to the parameters of $\psi$, but I observe consecutive values of the conditional density differing by many orders of magnitude at times.
With the computation of $\psi$ vectorised over samples of sequences $h_{t-l:t+1}$, the bottleneck is that each computation involves sampling $N_h$ sequences for $h$ and $N_h$ for $h'$.
Any advice in reducing this bottleneck, advice for adjustments to the model or alternative models altogether would be much appreciated.