Computing random numbers from an arbitrary (non-Gaussian) multivariant CDF

28 Views Asked by At

As part of a test harness I need to create random pairs of numbers according to a specified 2D CDF (Cumulative Distribution Function). The CDFs change over time and are the result of some opaque (i.e., ML (Machine Learning) like) process. The CDFs are, however, numerically well behaved, e.g., they are efficiently integrated and although differentiation has accuracy issues these are not insurmountable. The real application may have irregular domains, but I believe this part of the problem is separable.

So lets assume what seems to be the simplest form of the problem, generate pairs $(y_0,y_1)$ over the region $(y_0,y_1) \in [0,\ 1] \times [0,\ 1]$ such that $P_y(y_0,\ y_1) = y_0 + y_1$ . The key point is that $P_y$ is not factorable so it can't be represented as the product of independent random variables.

So it seemed that a simple yet generalizable approach was to start by defining the CDF in the standard way $$C_y(y_0,y_1) = \int_0^{y_0} \int_0^{y_1} P_y(y_0,y_1)\ dy_0\ dy_1$$ and generate independent uniform random variables $x_0$ and $x_1$ on the interval $[0,1]$. The first major step is solving for $Y_0$ such that $C_y(Y_0,1) = x_0$; this is, equivalent to solving for $Y_0$ such that the for the random variable $y_0$, $Prob(y_0 < Y_0) = x_0$. The second major step is using this value to solve for $Y_1$ such that $$\frac{C_y(Y_0,y_1)}{C_y(Y_0,1)} = x_1$$ this is equivalent to finding $Y_1$ such that $Prob(y_1 | Y_0 < Y_1) = x_1$.

This result fools the eye (it looks good), but it doesn't check numerically, which caused a subtle bug in the larger test system. I assumed a numerical bug, which I was promptly unable to find. Then I discovered that it doesn't check symbolically either.

For the assumed PDF (Probability Density Function) $C_y = 1/2\cdot (y_0^2\ y_1 + y_0\ y_1^2)$. So $$Y_0 = \frac{\sqrt{1+8\ X_0}-1}{2} $$ and $$Y_1 = \frac{\sqrt{4x_1\ Y_0 + Y_0^2 + 4x_1} - Y_0}{2} $$

Fully expanding $Y_1$ yilds a big mess $$ Y_1 = \frac{1}{4} - \frac{\sqrt {1+8x_0}}{4} + \frac{\sqrt {4x_1\, \left( -\frac{1}{2} + \frac{\sqrt {1+8x_0}}{2} \right) + \left( -\frac{1}{2} + \frac{\sqrt{1+8x_0}}{2} \right)^{2}+4x_1}}{2} $$

Fortunately, this simplifies like crazy when solving for the inverse (in just a second). Let $W(x_0,x_1) = (Y_0,Y_1)$ then $$ W^{-1}\left( Y_0,Y_1\right) = \left(\frac{Y_0^2 + Y_0}{2}, \frac{Y_1\ (Y_0 + Y_1)}{Y_1 + 1}\right)$$

Computing the Jacobian (not sure how much detail I should include here) yields $$ J = \frac{(2Y_0 + 1)(Y_0+2Y_1)}{2(Y_0 + 1)} $$ and because $P_x = 1$ for all $x_0$ and $x_1$ the Jacobian is the joint PDF of $(Y_0,Y_1)$ produced by this procedure. It turns out to be a somewhat decent approximation of the CDF (only off by 4% at the worst point), but it is not exact.

I think the procedure is theoretically flawed but: 1) I'm having trouble articulating the problem and 2) I don't know what procedure I should use instead.

1

There are 1 best solutions below

1
On BEST ANSWER

I'm used to write $f$ for densities and $F$ for cumulative distribution functions (cdf), I hope that won't cause confusion. As you wrote, we try to get our sample from independent random variables, uniformly distributed in $[0,1]$. For one-dimensional distributions, this is simple: if $Y$ has the (continuous) cdf $F$, then $X=F(Y)$ will be uniformly distributed in $[0,1]$. The other way round, we can find a variable $Y$ with cdf $F$ as $Y=F^{-1}(X)$.

Now, let's assume we have a two-dimensional distribution with joint density $f_{Y_0, Y_1}(y_0,y_1)$. Then, we can write it as a product of conditional densities, $$f_{Y_0, Y_1}(y_0,y_1)=f_{Y_0}(y_0)\,f_{Y_1}(y_1|y_0),\tag{1}$$ where $f_{Y_0}$ is the density of the marginal distribution, $$f_{Y_0}(y_0)=\int^\infty_{-\infty}f_{Y_0, Y_1}(y_0,y_1)\,dy_1.\tag{2}$$ For your simple model, we have to integrate only from $0$ to $1$, since the density is $0$ outside $[0,1]$, i.e. from $f_{Y_0, Y_1}(y_0,y_1)=y_0+y_1$ for $(y_0,y_1)\in[0,1]^2$, we obtain $$f_{Y_0}(y_0)=y_0+1/2.$$ The corresponding cdf is $$F_{Y_0}(y_0)=(y^2_0+y_0)/2,$$ so our first equation is $$X_0=(Y^2_0+Y_0)/2.\tag{3}$$ It's elementary to solve for $Y_0$: $$Y_0=\sqrt{2\,X_0+1/4}-1/2.\tag{4}$$ From (1), we obtain the conditional density $$f_{Y_1}(y_1|y_0)=\frac{y_0+y_1}{y_0+1/2},$$ giving the conditional cdf $$F_{Y_1}(y_1|y_0)=\frac{y_0y_1+y^2_1/2}{y_0+1/2}$$ for $y_1\in[0,1]$. So our second equation is $$X_1=\frac{Y_0Y_1+Y^2_1/2}{Y_0+1/2},\tag{5}$$ which can be elementarily solved for $Y_1$: $$Y_1=\sqrt{Y^2_0+X_0(2\,Y_0+1)}-Y_0.\tag{6}$$ Looking at (4) and (6), it's not obvious that $Y_0$ and $Y_1$ have the same distribution, but it's true. The Jacobian we get from (3) and (5) is exactly $Y_0+Y_1$, that's just equation (1).

I've checked the results numerically: from your density, we conclude $\mathbb{E}Y^2_0=\mathbb{E}Y^2_1=5/12$ and $\mathbb{E}Y_0Y_1=1/3$. Generating a sample of $10^8$ pairs $(y_0,y_1)$ is obvious with code like

fun pair(): Pair<Double, Double> {
    val x0 = Random.nextDouble()
    val x1 = Random.nextDouble()
    val y0 = sqrt(2.0 * x0 + 0.25) - 0.5
    val y1 = sqrt(y0 * y0 + x1 * (2.0 * y0 + 1.0)) - y0
    return Pair(y0, y1)
}

That gives estimates for the moments mentioned above like:

0.4166417030452688
0.4166630593865879
0.3333395345332752

The accuracy was to be expected. In your real application, the cdf will be more complicated, and inverting them may be a computational/numerical problem, but that should be a solvable problem.