Can we split a random variable into intervals on its domain of possible values and express it in terms of "simple" distributions on those intervals?

274 Views Asked by At

So suppose we have a random variable $Z$ which can take values in $\left[-A, A \right]$. Suppose we do not know the exact distribution of $Z$.

Now if we take $N$ fixed disjoint intervals of $[-A,A]$ such that $\mathop{\dot{\bigcup}}_{i=1}^N Y_i=[-A,A]$ and suppose that we know the probability $p_i=P(Z \in Y_i)$.

Now if we wish to sample $Z$ we can first sample from a categorical distribution to get in which interval $Y_i$ the sample of $Z$ will lie. From there we have to sample from the distribution of $Z|Z \in Y_i$.

If we know the distribution of $Z|Z \in Y_i$ we can easily sample $Z$ in this indirect manner. Now for my question:

Suppose we do not the distribution of $Z|Z \in Y_i$ and approximate this with a uniform distribution. How good is our approximation? Can we get any convergence results if we let $N \rightarrow \infty$ assuming we still can get the $P(Z \in Y_i)$?

Now based on the comments, I see that this is equivalent to asking if we know the cumulative distribution function $F_Z$ at the values $F_Z(x_i)=p_i$ where the $x_i$ correspond to the edges of the intervals $Y_i$ and we do linear interpolation. Now since the cumulative distribution function is a monotonicaly increasing function I think we should be able to achieve some sort of result on how well the linear interpolation approximates this. I hope this clarifies my question.

Futhermore, would it make a difference if $Z$ is unbounded?

1

There are 1 best solutions below

0
On BEST ANSWER

Let the real unknown PDF and CDF of $Z$ be $f(z)$ and $F(z)$ respectively. You can make this approximation more precise as $N$ tends to $\infty$ iff $$p_i=\Pr(Z\in Y_i)=\int_{Y_i}f(z)dz=|Y_i|\\\text{and}\\\max_{i}|Y_i|\to 0$$to show the sufficiency of this we try to show that $|\Pr(Z<z)-F(z)|$ can be arbitrarily small by choosing $N$ large enough and $p_i$ being chosen as defined above. For each $N$, let $Y_1,Y_2,\cdots Y_m$ and a fraction of $Y_{m+1}$ are included by region $Z<z$. Therefore$$\Pr(Z<z)=\Pr(Z\in Y_1\cup\cdots \cup Y_m\cup Y'_{m+1})$$since $Y_i$s are disjoint we can write$$\Pr(Z<z){=\sum_{i=1}^{m}\Pr(Z\in Y_i)+\Pr(Z\in Y'_{m+1})\\=\sum_{i=1}^{m}\int_{Y_i}f(z)dz+\int_{Y'_{m+1}}f(z)dz}$$but the term $\int_{Y'_{m+1}}f(z)dz$ can be chosen near enough to zero since $$\int_{Y'_{m+1}}f(z)dz<\int_{Y_{m+1}}f(z)dz=p_{m+1}=|Y_{m+1}|$$the last equality is because we have assumed uniform distribution on each $Y_i$ and all the cardinalities $|Y_i|$ tend to zero hence our primary assumption. This completes our sufficiency proof. For proving necessity, note that $Y_i$s can be interpreted as "clouds". To increase our knowledge about some random variable those clouds need to become smaller and tend to points otherwise there will always remain some ambiguity about the random variable in at least one cloud that doesn't get smaller after a while i.e. the $Y_i$s such that $$|Y_i|>\epsilon\qquad,\qquad\text{for every value of }N$$also since we have assumed uniform distribution on each cloud, the ambiguity will be maximum (you can refer to information theory and entropy function for this). Therefore the statement defined is necessary and sufficient and the proof is complete.