Biased sample from biased sample

93 Views Asked by At

A webpage has users, where each user has a number of projects uniquely assigned to him or her. I want a random sample of users by randomly sampling projects and then taking the users connected to this project into the sample. I am not interested in users without any projects. This approach will, however, create a biased sample of users, since users with more projects are more likely to be selected. Can I obtain an unbiased sample by selecting users from the biased sample into the unbiased sample with probability $$p_i = \frac{\sum_j c_j }{ c_i \sum_k (\frac{1}{c_k} \sum_j c_j )}$$ , where $c_i$ is the number of projects of user $i$? Subsequently, $\sum_j c_j$ is the sum of all projects for the users in the biased sample. The overall formula is supposed to create a probability that cancels out the original bias, at least for those chosen in the biased sample.

1

There are 1 best solutions below

0
On BEST ANSWER

Yes, you can do this with importance sampling if you just want to correct your estimate of a population statistic (e.g., mean, stdev, etc). However, if you want an actual random sample, then here's one possible approach:

Let $c_i$ be the number of projects assigned to person $i$ and $N$ be the total number of people with at least one project. Now, if you are uniformly sampling projects, then the probably that you will pull person $i$ into your sample is:

$$p_i = \frac{c_i}{M}$$

Where $M$ is the total number of projects. When you say you want a random sample over people, then you want a sampling scheme such that the new probability that person $i$ is in our sample, $\hat{p}_i$, is given by:

$$\hat{p}_i = \frac{1}{N}$$

Now, you don't actually have direct control over the probability of picking a person, but you do have control over the probability of picking a project. Unfortunately, it is not generally true that you can find a distribution over projects that will achieve the above goal (think of that rare person who has worked on all the projects!).

However, we can devise an iterative sampling technique that will get us there. The strategy will be to formulate a sequence of optimization problems, where the solution to each will be in the form of a probability distribution over projects and conditional distributions over people. Each iteration involves only a single sample, where we select a project then select a person who worked on that project. Thus, we will increase our sample size by one each iteration.

Let $a_{ij}=0$ if person $i$ did not work on project $j$. Otherwise, $a_{ij} \in [0,1]$. Additionally, let $q_j$ be the probability of selecting project $j$. Under this formulation, the probability that we choose person $i$ is given by:

$$P(\mathrm{Choose\;person\;i})=\sum_{j=1}^M q_ja_{ij}$$

The $a_{ij}$ define an $N\times M$ matrix, $\mathbf{A}$, and the $q_i$ define an $M$-dimensional vector $\mathbf{q}$. What we want is to find $\mathbf{A,q}$ such that:

$$\mathbf{Aq}=\frac{1}{N}\mathbf{1}$$

With constraints:

$$\sum_{j=1}^M a_{ij} =1, \;\; i \in 1...N$$ $$\sum_{j=1}^M q_j = 1$$ $$a_{ij},q_j\geq 0$$

In general, this will be an underdetermined problem. We can attack it using quadratic programming. We will define a quadratic objective function that represents the sum of squared differences between the probability of selecting person $i$ under some feasible assignment of $A,q$ and the target probability of $1/N$.

$$\arg \max_{a_{ij},q_j}\sum_{i=1}^N \left(\langle \mathbf{a}_{i\cdot},\mathbf{q}\rangle-\frac{1}{N}\right)^2$$

$$s.t.$$

$$\sum_{j=1}^M a_{ij} =1, \;\; i \in 1...N$$

$$\sum_{j=1}^M q_j = 1$$

$$a_{ij},q_j\geq 0$$

Where I've used angle-brackets $\langle a,b \rangle$ to denote the dot product of vectors $a,b$.

Except in special circumstances, there will be multiple optima to choose from, but any of them will do.

Using the above machinery, we can perform the following steps:

  1. Solve the above problem.
  2. Select a project according the probability distribution given by $\mathbf{q}$. Let $J$ be the selected project.
  3. Choose a person $i$ with probability $a_{iJ}$. Let $I$ be the selected person.
  4. Add person $I$ to your sample and set $a_{Ij}=0,\;\;\forall j \in 1...M$
  5. Decrement $N$: $N\to N-1$
  6. Re-solve the above problem with the new values for $N$ and $a_{I\cdot}$
  7. Repeat steps 2 through 6 until you have your desired sample size.

The reason this works is because we've designed the sampling sequence to be exchangeable. At each step, each person has the same probability (albeit increasing) of being selected, so for a sample size of $K$, every possible ordering of $K$ people from the available set of $N$ people is equally likely. So each sample has the same probability and you have a random sample.

Sooo...tedious but doable (at least in principle).