I have a very generic question about applied statistics.
Suppose, to make things simple, we have a biased coin with probability $p$ of landing heads. We want to determine if our coin is truly fair - that is, if $p=1/2$.
We can do this by flipping the coin several times, generating a sequence such as $$0,0,1,0,0,1,1,0,1,0,1,1,1,1,0$$ for example. Now we must determine if that sequence of numbers is "random".
Typically statistical testing for randomness involves so-called "suites" or "batteries" which consist of several tests put together. For example, random.org list $15$ different tests on this page which are used to confirm the randomness of its number generator.
My first question is: how on earth do they justify the use of all these tests simultaneously? Surely the $15$ tests are all interdependently correlated in a way that is hopelessly complicated? I don't see how it would be possible to make sense out of such a vast array of results.
Secondly, and more importantly: let's say we are free to choose any statistical test we want (we can even make one up), after the sequence of coin flips has been generated. Is it always possible to cook up some unsavory mess of a function which returns $p$-values arbitrarily low? That is, can we fabricate a statistic such that its attaining the value for the given event of coin flips has (assuming randomness) probability smaller than any $\epsilon>0$ that is given?
If so, what does this say about the objectivity of statistical testing. There are many different statistics which could be measured for a sequence of coin flips. Some of these, undoubtedly, will return very unlikely results. Humans are free to choose both which tests to use and what $p$-values to reject at - does this have implications for the practice of statistics? How can we measure the "randomness" of that sequences in an objective fashion, without incorporating human bias?
EDIT: Nobody has yet touched on the question of whether, for any given sequence of $1$'s and $0$'s a statistic can be constructed such that $P(\text{stat outcome})$ is arbitrarily small. I believe this demonstrates a negative answer:
Since there are $2^n$ possible sequences of length $n$, the statistic can take on at most $2^n$ different values over the event space. Therefore, the least possible probability would be the chance of getting that one outcome alone, which is $2^{-n}$. Therefore the $p$-value cannot be made smaller than any given $\epsilon>0$ - does this look correct?
A $p$-value is actually defined (in Statistical Inference by Casella and Berger, for instance) to be a statistic, not just a number, and it is called valid if it satisfies $$ P_\theta[p({\bf X}) \leq \alpha] \leq \alpha $$ for $\theta \in \Theta_0$. So the probability of seeing small values is genuinely low under the null hypothesis. This makes sense. If you constrain yourself to valid $p$-values then, you can't simply choose any $p$-value you like.
Also keep in mind that there is the notion of the information contained in a sample, and the degree to which a statistic preserves it.