Formal definition of sampling

411 Views Asked by At

I can identify sampling when I see it, and I can write a program to sample from a distribution, but I'm wondering if there is a more rigorous, formal way to define sampling. Something more than "a process of selecting a member of a population according to some distribution." Is there a deeper definition, or is this as far as it goes?

1

There are 1 best solutions below

0
On

Note: I found the wikipedia article on sampling to be quite good as an informal overview.


One thing to note is that the use of the word sample is ambiguous. For example, when I go to an ice cream shop and ask for a "sample", how well does this accord with the statistical notion of sampling?

More seriously, if I'm analyzing a convenience sample (say, respondents to an XBox survey), how is this different from the previous example and from the canonical "simple random sampling" we all learn in Intro Stats?

So, you have asked an interesting question about the word "sample" and I think it does go deeper. I haven't found a solid reference to the philosophy of sampling (I'd love to see such a work) but I'd argue that there is a difference between data and a sample. Specifically, that data is an objective thing (any information can be data), but a sample is contextual. You cannot look at a list of numbers and say "that is a sample".

So what makes data a sample? My opinion is that a valid sample should be data collected for the purpose of answering one or more statistical questions. Not only that, but a valid sample should be relevant to the question being asked.

This is why I think that data only becomes a sample when you have context. Let's say I give you a list of temperatures. Is this a sample? It is if it were collected by, say, a meteorologist for the purpose of assessing the mean temperature in the vicinity of the sensor. However, it is not necessarily a sample for assessing the probability a politician will win an election.

I say "not necessarily" because these temperatures could be a valid sample for this question (at least in principle) -- for example, if you have a model that correlates temperature with voter turnout.

So, perhaps we can distill the essence of "sampleness" (totally made up word...but it's philosophy, so we get to take liberties ;-) by the following (tentative) definition:


Let $\Omega$ be a set and $I$ be a set function defined over subsets of $\Omega$. Further, let $X\subset \Omega$.We say that $X$ is a sample for $I$ iff there exists $Y \subset \Omega$ such that $X \neq Y$ and $I(X) \neq I(Y)$.


To me, this definition is the essence of a sample: a sample should have the capability of changing the outcome of an inference, decision, prediction, or estimate (all of these are examples of $I$).

In the case of parametric statistics, we can define $I$ to be the likelihood function of the data set $X$. For nonparametic statistics, we'd have to have $I$ be the test function $T(X)$ or the estimate/prediction.

Of course, in real-world examples, we've already done this filtering, so we almost always have valid samples. It would be a big mistake indeed to create in invalid sample for a problem (e.g., measure the diameter of apples to estimate the width of the Andromeda galaxy).

However, on a more philosophical level, if we assume our world is coherent then we cannot deny that there could exist, in principle, an informative relation between any two seemingly unrelated objects. So, again, the concept of sample must be defined with regard to a specific function $I$ for it to be precise.