Is inequality between population samples guaranteed?

184 Views Asked by At

This is a genuine mathematical inquiry. Depending on its assessment it has many predictions. Some of these would be bad for society. On the other hand it would be scientifically important and useful to society in other ways. From homelessness, to car accidents, to so forth.

The proposal is

Between any two population samples, unless they are the same sample, the quotient of the distributions becomes very large or very small at the tail ends.

I am specifically requesting an analysis of this claim which can either strengthen or debunk its validity. Equally valid is successfully attacking the reasoning which leads to this proposal. Acceptable is a good reference(s).

To prove this in the ideal case only use the normal distribution is used. This is why this inquiry is within scientific reason.

Supposing two samples have respective statistical distributions $A(x) = \sqrt{(\alpha / \pi)} e^{-\alpha(x+\beta)^2}$ and $B(x) = \sqrt{(\alpha'/\pi)}e^{-\alpha'(x+\beta')^2}$.

If $\alpha' = \alpha + \Delta \alpha$ and $\beta' = \beta + \Delta \beta$ then

$$ \frac{A(x)}{B(x)} = \sqrt{\frac{\alpha}{\alpha + \Delta \alpha}}\exp{\Big(x^2\Delta\alpha + 2(\alpha\Delta\beta +\beta \Delta \alpha + \Delta \alpha \Delta \beta)x + \alpha'\beta'^2-\alpha \beta^2\Big)} $$

For small $\Delta \alpha$ or $\Delta \beta$ this is nearly 1 around the origin, but is an exponential raised to a square. It will increase or decrease extremely rapidly in the left and right directions. If $\Delta \alpha$ and $\Delta \beta$ are sufficiently small, this makes our proposal null as the ratio is nearly 1 for all values that matter. But I suspect this is level of equality is not generally the case in real life and might not even be humanly possible to create. For larger values this inequality would be amplified.

An important point is that this demonstrates why the tail ends of the distribution are unreliable and unrepresentative as a sample for the entire distribution. However everyone in their area are in the tail ends of their own distribution. This makes getting samples harder. Researchers often sample students, but in light of this may be unreasonable and innaccurate.

2

There are 2 best solutions below

2
On

I disagree with much of what you have to say, partly on grounds of personal opinion and partly because you are stating mathematical propositions without proof.

There is one possibly related fact that may be of interest. If 1/3 of the population of a city belongs to group A and 2/3 belong to group B, suppose we choose 75 people at random.

The probability that exactly 25 of them will be from group A is determined by the binomial distribution. If $X$ is the number of A's chosen out of $75,$ then $X \sim \mathsf{Binom}(75, 1/3)$ and $$P(X = 25) = {75 \choose 25}\left(\frac 1 3\right)^{25} \left(\frac 2 3\right)^{75} = 0.0973.$$

dbinom(25, 75, 1/3)
[1] 0.09734124

So if you're looking for the sample proportion to be exactly the same as the population proportion, you will seldom see that. [It turns out that $\{X=25\}$ is the most likely result, but there are too many competing results for the most likely result to be hugely likely.]

However, the probability of being within five of $25$ is $P(20 \le X \le 30) = 0.8227.$

sum(dbinom(20:30, 75, 1/3))
[1] 0.8227381
diff(pbinom(c(19,30), 75, 1/3))
[1] 0.8227381

enter image description here

Addendum: Here is another case of what might be called 'systematic unfairness'. Four equally talented people play 12 games of Monopoly. You might think that would usually result in about three wins for each person. But in such 12-game 'tournaments' the mean number of games won by the most successful of the four is about 5.6 games. (Least successful about 0.83.) Here is a simulation of 100,000 12-game tournaments.

w = replicate( 10^5, 
    max(rowSums(rmultinom(12, 1:4, rep(1:4)))) )
mean(w)
## 5.57715
1
On

Leaving aside the intent-implications, here's a model, inspired on chess. Let's say that the "player strengh" in some competition is modeled by a random variable $X$ (in chess, it could be the ELO), its distribution be $f_X(x)$ (typically a Gaussian distribution, or perhaps a logistic one).

Given two random players $A ,B$ , let's assume that the probability of $A$ winning (let's discard draws) is given by some probability which is a (odd aroud $(0,\frac12)$ ) funcion of the strengh difference: $p = g(X_A-X_B)$ . Again, $g$ could typically be a Gaussian or logistic cumulative distribution function.

Now, assume we have two different populations, with similar but not identical average strengh, e.g. $f_{X_A}(x)=f_{X_B}(x-\epsilon)$ (population $A$ is slightly stronger).

The OP is interested (I think) in showing that, under these conditions, for some typical/reasonable distributions, the advantage for population $A$ (its expected winning rate) is small if we pick a random player from each population, assuming $\epsilon$ is small. But that the advantage dramatically increases if we pick the random players from the upper percentiles.