How to multivariate stratified sampling

507 Views Asked by At

How can I implement stratified sampling when each sample holds multivariate information. Also am not sure how to name this problem or where to exactly look for the solution so any indication would be useful.

Next i try to state my problem as much as I can

problem, example:

Suppose we have $n$ objects denoted as $o_n$. Each object has $m$ different attributes. These attributes are simple counts. e.g. say $n=100$, and $m=3$. object 1 of the 100 denoted as $o_1$ has $m_1=3, m_2=5$ and $m_3=1$.

How can we split the samples into $s$ different groups where the attributes representing these groups have similar attribute counts. i.e. if we had two groups $g_1$ and $g_2$ then $\sum_1^3 m^{g_1} \approx \sum_1^3 m^{g_2}$

1

There are 1 best solutions below

0
On

One of the first steps when one use stratified sampling is to define the strata. Strata should be define such that the elements in a stratum are as homogeneous as possible. In addition, the elements of different strata must be heterogeneous. One way of getting strata like that is using machine learning techniques. One possibility would be to use Linear discriminant analysis to choose strata (there are many other choices, but this one is quite easy to implement). If you are interested in having strata where the variables have similar sum, you can add the sum as an auxiliar variable in the linear discriminant analysis model to get strata with similar sum.

Once the strata are chosen, the second step is allocating the total sample size among the strata. When $m=1$ one popular (and in some sense optimal) solution is the Neyman allocation, which minimize the variance $\min V(\hat{Y})$ subject to the condition that the sum of the sizes recover the total size, $\sum_{h}n_h=n.$

$$\text{Neyman allocation: } n_h=n \cdot \dfrac{S_h N_h}{\sum_h S_h N_h},$$ where $S_h$ is the square root of the population quasi-variance and $N_h$ the total number of element in the stratum $h$. In the multidimensional case $m>1,$ usually one should minimize a weighted sum of variances $\min \sum_{i}^{m} \alpha_i \cdot V(\hat{Y}_i)$ subject to $\sum_{h}n_h=n,$ (where $\sum _i \alpha_i=1$).

About estimation, if your target value is linear in the variables, Horvitz–Thompson estimator is a great choice (not the only one, and it can be improved if you have auxiliar information).