Can we collect a sub-sample of a labeled set of partially dependent data to get an independent sample?

16 Views Asked by At

Consider two samples of visit lengths to two different Emergency Departments. We want to test whether or not the means of the samples are different. A basic requirement is independence between and within samples. However, while we know that there are no patients that had visits in both samples, there are patients who had multiple visits within a given sample. Therefore, we cannot consider the visit lengths within a given sample to be independent.

How do we deal with this? The visit lengths are labeled with patient identifiers. If, for every set of visits associated with a patient (most of which will contain a single visit) we randomly sample one visit, can we use the resulting sub-sample for hypothesis testing, or will this process inject bias? If so, how could we control for the bias?

I've done a fair amount of searching on this topic, but there's a lot of obfuscation. Maybe I just need someone on here to tell me what terms I should be searching for.

1

There are 1 best solutions below

0
On

In the scenario you described, if there are patients that had multiple visits within a sample then visit lengths in that sample are not independent. You can choose to select 1 visit per patient randomly and create a sub-sample for testing.

Random sampling can help address the lack of independence within the sample but might create other biases in the process. The random selection of visits may introduce bias if certain patients or types of visits are more likely to be selected, which could impact the representativeness of the sub-sample.

To control for this, try stratifying the data based on factors such as age groups, gender etc. Within each stratum, randomly select one visit per patient. This approach ensures that the sub-sample represents the different strata of the population.