References for a statistics question relating to clustering

30 Views Asked by At

I am interested in references for the following research topic. It was mentioned to me that this may be a classically studied question, but I'm unsure what line of work of references to begin looking into, and would much appreciate some suggestions!

Suppose you have two “nice” regions $A$ and $B$ in $R^d$ separated by some distance delta, how many random samples do you need to see from the uniform distribution over $A \cup B$ so that if you connect all pairs at distance $< \delta$, you will get good approximations to $A$ and $B$. Here, the notion of "good approximation" to $A$ and $B$ is that the connected components $A'$ and $B'$ should separate points from $A$ and $B$ to good accuracy. It seems related to this work on Consistent procedures for cluster tree estimation and pruning (Chaudhuri, Dasgupta, Kptofe, and Luxberg) https://arxiv.org/pdf/1406.1546.pdf, but is not quite the same due to their more "robust" notion of connecting points together.

I am also curious about the following balls and bins generalization. Suppose that instead of considering the problem where we have a set of n bins and want to know the number of balls to fill each bin, we instead consider the case where we have n bins grouped into k buckets of size $n/k$ each. I want to know how many balls must be thrown in order to ensure that all bins in some bucket are observed.

Formally, let $S_t$ denote the set of bins which contain a ball after throwing $t$ balls in our experiment. Let $[n] = \cup_{i \in [k]} B_i$ form a partition of the $[n]$ bins into $k$ equally sized buckets $|B_i| = n/k.$ The goal is to obtain a high probability bound on $T$ where $T$ is the number of balls required to ensure that $B_i \cap S_T = B_i$ for some $B_i$.