Generate samples from other samples

126 Views Asked by At

Given a family of continuous random samples $(x_i)_{i \in I}$ that approximate some unknown probability distribution. How can I generate more samples that fit to the same unknown distibution?

Assumption: I first have to estimate the unknown distribution and then generate samples from this distribution.

If my assumption is valid, how do I do this estimation? Otherwise, is there a direct approach to generate samples from other samples? What are the drawbacks?

2

There are 2 best solutions below

1
On

Are you familiar with kernel density estimation? Refer to the book Density Estimation for Statistics and Data Analysis by Bernard. W. Silverman.

0
On

Here are three related topics. Suppose, we are given the following vector x of $n = 100$ observations rounded to two places, with sample mean $\bar X = 2.0798$, and sample variance $S^2 = 1.13149.$

0.80 2.06 2.10 1.13 2.26 1.25 2.79 1.22 1.33 0.50 1.81 2.76 1.65 1.77 2.47 1.20 1.88 1.32 1.75 1.71
0.66 1.76 1.76 2.44 2.85 6.66 3.61 0.64 3.37 0.89 3.14 1.32 0.96 0.86 1.29 2.41 1.91 1.06 0.93 3.81
3.57 0.83 3.28 3.53 1.26 1.74 2.09 1.57 3.21 2.08 1.76 1.35 2.07 1.91 1.94 1.12 1.14 5.72 1.88 1.93
1.11 1.59 1.35 2.30 1.39 1.90 1.30 2.00 3.71 2.81 1.32 4.77 3.04 2.72 2.38 2.10 1.82 2.50 2.38 1.53
1.91 1.01 3.42 2.28 2.86 1.73 2.63 1.77 2.61 0.78 1.08 1.73 1.16 4.20 2.52 3.87 1.83 1.21 1.90 3.36

You might find these notes by C. Wickham of interest.

Parametric bootstrap. This is a 're-sampling' procedure for the purpose of finding a confidence interval of a population parameter. Here, suppose the parameter is the mean $\mu$ of the population from which our observations were randomly sampled.

In order to carry out this procedure, we need to know the distribution family of the population from which the data were generated. Suppose it is $Gamma(\text{shape}=\theta, \text{rate}=\lambda),$ where $\mu = \theta/\lambda.$ For our sample, the method of moments estimates are $\hat \theta = \bar X^2/S^2 = 3.822927,\,$ $\hat \lambda = \bar X/S^2 = 1.838114,$ and $\hat \mu = \bar X$ denoted obs.mean below.

In order to find a confidence interval for $\mu,$ we would like to know the distribution of the random variable $D = \bar X - \mu.$ If we knew its distribution, we could find values $L < 0$ and $U > 0$ cutting off the lower and upper 2.5% of the probability from that distribution to obtain $$P(L < D = \bar X - \mu < U) = 0.95,$$ and hence the 95% confidence interval $(\bar X - U, \bar X - L)$ for $\mu.$

However, not knowing these value $L$ and $U,$ we enter the 'bootstrap' world', in which we take a large number $B$ of re-samples from $Gamma(\hat \theta, \hat \lambda),$ finding the bootstrap mean $\bar X^*$ of each sample. We emulate the distribution of $D$ with the many values $D^* = \bar X* - \bar X,$ where in the bootstrap world $\bar X$ is a temporary proxy for $\mu.$

Returning to the real world, we approximate $L$ by $L^*$ and $U$ by $U^*,$ were $L^*$ and $U^*$ cut 2.5% from the lower and upper tails of the the re-sample of $B$ values of $D^*.$ Then our 95% parametric bootstrap CI for $\mu$ is $(\bar X - U^*, \bar X - L^*),$ in which $\bar X$ has returned to its original role as the mean of the original sample. The resulting CI is $(1.87. 2.29).$ The R program to obtain this CI is shown below, in which suffixes .re replace the superscripted *'s above.

obs.mean = mean(x)
B = 10000;  d.re=numeric(B)
for(i in 1:B) {
  x.re = rgamma(100, alp.mme, lam.mme)
  d.re[i] = mean(x.re) - obs.mean }
obs.mean - quantile(d.re, c(.975,.025))
##     97.5%     2.5% 
##  1.865877 2.285135 

Nonparametric bootstrap. As your problem is stated, you give no clue as to the distribution family of the population. In that case, the procedure is much the same. However, we cannot re-sample from a distribution with parameters estimated from the data. Instead, we take re-samples of size $n = 100$ with replacement from the original data x.

The rationale is that the data provide an empirical CDF of the unknown population distribution. All that is known about the population is that it is capable of producing the 100 observations we have at hand. The resulting nonparametric bootstrap is $(1.86, 2.82)$.

obs.mean = mean(x)
B = 10000;  d.re=numeric(B)
for(i in 1:B) {
  x.re = sample(x, 100, repl=T)
  d.re[i] = mean(x.re) - obs.mean }
obs.mean - quantile(d.re, c(.975,.025))
##    97.5%     2.5% 
## 1.861853 2.282402 

The truth. The data were simulated from $Gamma(4, 2)$ with mean $\mu = 2.$ So in this case we have the nice advantage of knowing that our CIs actually do contain the true value of $\mu.$ In R, you can simulate exactly the sample shown by using the same seed for the random number generator I did. Also, $ \hat \theta = 3.82,$ and $\hat \lambda = 1.84.$ Maximum likelihood estimators are preferable, but messier, and this is too long already. Moroever, the sample size $n = 100$ is large enough that it would be safe to find a t CI for $\mu: (1.83, 3.82)$

set.seed(1234)
x = rgamma(100, 4, 2);  round(x, 2)
a = mean(x);  v = var(x)

lam.mme = a/v;  tht.mme = a^2/v
lam.mme;  tht.mme
## 1.838114
## 3.822927

pm = c(-1,1);  t.cut = qt(.975, 99)
mean(x) + pm*t.cut/sd(x)/sqrt(100)
## 1.893273 2.266346

It is important to understand that bootstrap re-sampling is a method of data analysis. No new information is 'generated' by re-sampling. Re-sampling is just a clever way to use the information in the original sample.

Density estimator. Density estimators were mentioned in the Comment by @AbishankaSaha. The figure below shows a histogram of the data. The blue curve is the PDF of $Gamma(4, 2)$ and the red curve is the default kernel density estimator implemented in R. It might be possible to sample from the density-estimated distribution, but there is no more information in the density estimator than in the sample.

hist(x, prob=T, col="skyblue2")
curve(dgamma(x, 4, 2), lwd=2, col="blue", add=T)
lines(density(x), lwd=2, col="red")
rug(x)

enter image description here