Best practice for pairing samples for linear regression

18 Views Asked by At

I am building linear regression models in R where the two distributions do not have ground truth or any obvious method for pairing samples from each.

What is the best practice for this scenario? The most obvious method would be to just sort both distributions but I'm wondering if there are any better methods. The other method I thought of would be to pair samples with nearest neighbor by percentile or rank.

What about if the two distributions have different amounts of samples? Which should be removed? Should samples ever be duplicated?

Any help would be very appreciated.