Balanced Snowball Sampling

Question

Balanced Snowball Sampling

52 Views Asked by Bumbble Comm At 29 Mar 2026 - 9:08

There is a method in qualitative research called "balanced snowball sampling", which runs as follows. Say the interviews are assessing perspectives on some issue, each interviewee is asked to recommend two further interviewees, one who will offer a more positive perspective, and one who will offer a more negative perspective than their own.

In a simple model of this we are sampling some continuous distribution $p(x)$. However, we are not allows to sample $p(x)$ directly: instead we start from some arbitrary initial sample of a single point (not necessarily drawn from $p(x)$, but necessarily within its support), and the only allowed mechanism for growing the sample size is take some point $x_i$ in our sample, and generate two additional points $x_j^>$ and $x_j^<$ drawn randomly from $p(x|x>x_i)$ and $p(x|x<x_i)$ respectively.

Interestingly, if you apply this method recursively (generating two points from every point in your sample, and two points from each of those points, and so on), the sample distribution does converge, but not to $p(x)$. Instead, the sample distribution over represents tail events.

I have two questions:

What distribution does the sample converge to?
Is there a straightforward way of including a rejection (discarding certain points from the sample) which means that the sample does converge to $p(x)$?

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

I agree with you on the answer to C, when starting from a uniform distribution on $[0,1]$, is an arcsine distribution, i.e. a beta distribution with parameters $\frac12,\frac12$ with density $\frac1{\pi \sqrt{x(1-x)}}$, and Cumulative Distribution Function $\frac2\pi \arcsin(\sqrt{x})$, and quantile function $\sin^2\left(\frac\pi 2 p\right)$. It is not difficult to prove this is stable (assuming a value drawn from this distribution, its two descendents have a joint distribution equal to that of the minimum and maximum of two i.i.d. values from this distribution) though it is a slightly harder to show it must therefore be the limiting distribution.

I think that gives potential approaches to D in general: rank the responses in your full sample $x_{(1)}, x_{(2)}, \ldots, x_{(n)}$ and give them weights based on their position in the arcsine distribution, such as $w_{(i)} =\sin\left(\frac{i-\frac12}{n} \pi\right)$.

Then you can either use all the sample observations with these weights or accept each $x_{(i)}$ with probability $w_{(i)}$ or reject with probability $1-w_{(i)}$. This should get you close to a sample from the original distribution.

Although the weights come from an arcsine distribution, the values you apply them to can come for any continuous distribution for a random variable $X$, as the generation of new values is related to the distribution of $F(X)$ and that uniform on $[0,1]$.

As an illustration, here is an arbitrary simulation from an exponential distribution with rate $2$ (any other continuous distribution would do, changing the first two lines of the code):

F    <- function(x){ pexp(x,2) } # actual distribution CDF
Finv <- function(x){ qexp(x,2) } # actual distribution quantile
start <- 0.1234 # arbitrary starting point in support 
numberofrounds <- 15 # full sample size will be 2^(numberofrounds+1) - 1

set.seed(2024)
latest <- start 
fullsample <- latest
for (r in 1:numberofrounds){
  latest <- c(Finv(runif(2^(r-1), 0, F(latest))), 
              Finv(runif(2^(r-1), F(latest), 1)))
  fullsample <- c(fullsample, latest)
  }    
plot.ecdf(fullsample)
curve(F(x), from=min(fullsample), to=max(fullsample), col="pink", add=TRUE)

As you say, the full sample (black ECDF below) does not match the underlying distribution (pink) and it oversamples in the tails.

Select from that, weighting towards the centre:

fullsize <- length(fullsample)
sortsample <- sort(fullsample)
probselect <- sin((((1:fullsize)-1/2)/fullsize) * pi)
selectsample <- sortsample[runif(fullsize) < probselect]
plot.ecdf(selectsample)
curve(F(x), from=min(fullsample), to=max(fullsample), col="pink", add=TRUE)

which seems to give a much better fit, with the pink and black curves close to identical. It uses almost $64\%$ of the original full sample values. Essentially the same thing would happen with any other continuous distribution, and you can do the rejection sampling without needing to know what that distribution is (even if you do need the distribution for simulation purposes).

Balanced Snowball Sampling

There are 1 best solutions below

Related Questions in STATISTICS

Related Questions in SAMPLING

Trending Questions

Popular # Hahtags

Popular Questions