I posted a variation of this question on Cross-validated, but did not get any answer, so I hope someone can help me over here.
A bit of background first. I have implemented a neural network for time series forecasting. The network outputs the parameters (mean $\mu$ and dispersion $\theta$) of a negative binomial distribution
$$\Pr(X = x) = \binom{x+\theta-1}{x} \left(\frac{\mu}{\theta + \mu}\right)^\theta \left(\frac{\theta}{\theta + \mu}\right)^x$$
To ease with model training, I want to scale the input data (i.e., divide by $k$ the past timesteps fed to the network) and then remove the scaling effect on the predicted distribution parameters. If my network was outputting the mean and variance of a Gaussian distribution, I would multiply the predicted mean and variance by $k$ and $k^2$, respectively. However, I am not sure how to do this for a negative binomial.
After running some experiments, the way to do it that I came up with (see the Cross-validated question) does not seem right. In the DeepAR paper (p. 5), the authors multiply the mean and dispersion by $k$ and $\sqrt{k}$, respectively. But I do not know where multiplying the dispersion by $\sqrt{k}$ comes from.
Moreover, after reading this question I wonder if there is actually a solution to what I want to do.
I would appreciate any help.