Do the Kolmogorov's axioms permit speaking of frequencies of occurence in any meaningful sense?

1.3k Views Asked by At

It is frequently stated (in textbooks, on Wikipedia) that the "Law of large numbers" in mathematical probability theory is a statement about relative frequencies of occurrence of an event in a finite number of trials or that it "relates the axiomatic concept of probability to the statistical concept of frequency". Isn't this is a methodological mistake of ascribing an interpretation to a mathematical term, perhaps relying too much on the colorful language, that does not at all follow from how this term is mathematically defined? Recall the typical derivation of the WLLN:

Let $X_1, X_2, ..., X_n$ be a sequence of n independent and identically distributed random variables with the same finite mean $\mu$, and with variance $\sigma^2$ and let:

$\overline{X}=\tfrac1n(X_1+\cdots+X_n)$

We have:

$E[\overline{X}] = \frac{E[X_1+...+X_n]}{n} = \frac{E[X_1]+...+E[X_n]}{n} = \frac{n\mu}{n} = \mu$ $Var[\overline{X}] = \frac{Var[X_1+...+X_n]}{n^2} = \frac{Var[X_1]+...+Var[X_n]}{n^2} = \frac{n\sigma^2}{n^2} = \frac{\sigma^2}{n}$

And from Chebyshev's inequality:

$P(|\overline{X}-\mu|>\epsilon) \le \frac{\sigma^2}{n\epsilon^2}$

And so X is said to converge in probability to $\mu$.

Now consider what is strictly speaking the meaning of this expression in the axiomatic framework it is derived in:

$P(|\overline{X}-\mu|>\epsilon) \le \frac{\sigma^2}{n\epsilon^2}$

$P()$, everywhere it occurs in the derivation, is known only to be a number satisfying Kolmogorov's axioms, so a number between 0 and 1, and so forth, but none of the axioms introduce any theoretical equivalent of the intuitive notion of frequency. If additional assumptions about $P()$ are not made, the sentence can obviously not be interpreted at all, but what is also important the theoretical mean $\mu$ is not necessarily the mean value in an infinite number of trials, $\overline{X}$ is not necessarily the mean value from n trials, and so forth. Consider an experiment of tossing a fair coin repeatedly - quite obviously, nothing in Kolmogorov's axioms enforces using 1/2 for the probability of heads, you could just as well use $1/\sqrt{\pi}$, yet the derivation continues to "work", except the meaning of the various variables is not in agreement with their intuitive interpretations. The $P()$ might still mean something, it might be a quantification of an absurd belief of mine, the mathematical derivation continues be true regardless, in the sense that as long as the initial $P()'s$ satisfy axioms, theorems about other $P()'s$ follow, and with Kolmogorov's axioms providing only weak constraints on and not a definition of $P()$, it's basically only symbol manipulation.

This "relative frequency" interpretation frequently given seems to rest on an additional assumption, and this assumption seems to be a form of the law of large numbers itself. Consider this fragment from Kolmogorov's Grundbegriffe on applying the results of probability theory to the real world:

We apply the theory of probability to the actual world of experiment in the following manner:

...

4) Under certain conditions, which we shall not discuss here, we may assume that the event A which may or may not occur under conditions S, is assigned a real number P(A) which has the following characteristics:

a) One can be practically certain that if the complex of conditions S is repeated a large number of times, n, then if m be the number of occurrences of event A, the ratio m/n will differ very slightly from P(A).

Which seems equivalent to introducing the weak law of large numbers in a particular, slightly different form, as an additional axiom.

Meanwhile, many reputable sources contain statements that seem completely in opposition to the above reasoning, for example Wikipedia:

It follows from the law of large numbers that the empirical probability of success in a series of Bernoulli trials will converge to the theoretical probability. For a Bernoulli random variable, the expected value is the theoretical probability of success, and the average of n such variables (assuming they are independent and identically distributed (i.i.d.)) is precisely the relative frequency.

This seem to be mistaken already in claiming that from a mathematical theorem anything can follow about empirical probability (the page on which defines it as the relative frequency in actual experiment), but there are many more subtle claims that technically also seem erroneous from the above considerations:

The LLN is important because it "guarantees" stable long-term results for the averages of random events.

Note that the Wikipedia article about LLN claims to be about the mathematical theorem, not about the empirical observation, which was also historically sometimes been called the LLN. It seems to me that LLN does nothing to "guarantee stable long-term results", for as stated above those stable long-term results have to be assumed in the first place for the terms occuring in the derivation to have the intuitive meaning we typically ascribe to them, not to mention something has to be done to at all interpret $P()$ in the first place. Another instance from Wikipedia:

According to the law of large numbers, if a large number of six-sided die are rolled, the average of their values (sometimes called the sample mean) is likely to be close to 3.5, with the precision increasing as more dice are rolled.

Does this really follow from the mathematical theorem? In my opinion, the interpretation of the theorem that is used here, rests on assuming this fact. There is a particularly vivid example in the "Treatise on probability" by Keynes of what happens when one follows the WLLN with even a slight deviation from this initial assumptions of p's being the relative frequencies in the limit of an infinite number of trials:

The following example from Czuber will be sufficient for the purpose of illustration. Czuber’s argument is as follows: In the period 1866–1877 there were registered in Austria

m = 4,311,076 male births

n = 4,052,193 female births

s = 8,363,269

for the succeeding period, 1877–1899, we are given only

m' = 6,533,961 male births;

what conclusion can we draw as to the number n of female births? We can conclude, according to Czuber, that the most probable value

n' = nm'/m = 6,141,587

and that there is a probability P = .9999779 that n will lie between the limits 6,118,361 and 6,164,813. It seems in plain opposition to good sense that on such evidence we should be able with practical certainty P = .9999779 = 1 − 1/45250 to estimate the number of female births within such narrow limits. And we see that the conditions laid down in § 11 have been flagrantly neglected. The number of cases, over which the prediction based on Bernoulli’s Theorem is to extend, actually exceeds the number of cases upon which the à priori probability has been based. It may be added that for the period, 1877–1894, the actual value of n did lie between the estimated limits, but that for the period, 1895–1905, it lay outside limits to which the same method had attributed practical certainty.

Am I mistaken in my reasoning above, or are all those really mistakes in the Wikipedia? I have seen similar statements all over the place in textbooks, and I am honestly wondering what I am missing.

5

There are 5 best solutions below

42
On

Kolmogorov's axioms, if one were to make an assumption about the distribution of the random variable $X_i$, could be used to derive the distribution of the random variable $\bar{X}$. Notice in the last statement that since $X_i$ is a random variable, $\bar{X}$ is also a random variable. The fact that $\bar{X}$ is a random variable means that there is a probability measure for the random variable $\bar{X}$. The beauty of the WLLN is that so long as both $\mu$ and $\sigma^2$ are finite, no assumptions about the measure $P()$ must be made in order to derive that $\bar{X_n}$ converges in probability to $\mu$. I agree with Hurkyl. Perhaps this post will help with the concept of a random variable https://stats.stackexchange.com/questions/50/what-is-meant-by-a-random-variable

You do make a good point, however, about whether or not the assumptions that the $X$'s are independent and identically distributed random variables may not be true in practice, which is the problem alluded to in the Keynes example.

The example regarding dice appears to rely on the assumption that the die is fair, which may or may not be reasonable depending on how the die is constructed and rolled. However, it seems reasonable to assume that there exists appropriate setups of a dice rolling experiments for which the rolls are $i.i.d$ random variables with a probability measure $P$. In such a case, it does follow from the WLLN that $\bar{X}$ would indeed converge to $\mu$.

0
On

What you're missing is that the derivation of the WLLN is allowed to use, not only the Kolmogorov axioms, but also the assumption stated in the theorem: "The $X_1,X_2,\dots,X_n$ are a sequence of $n$ independent and identically distributed random variables with the same finite mean μ, and with variance $σ^2$". So, for example, if we are tossing a fair coin, we know that μ=1/2 (this is what "fair coin" means in probability theory), not $1/\sqrt\pi$. And likewise, in a Bernoulli trial, we are given the actual mean to which the observed probabilities are supposed to converge. And Keynes/Czuber's example isn't a valid application of the LLN because we are not given the actual mean and standard deviation.

So the first two claims in the Wikipedia article are basically correct (except that "will converge to the theoretical probability" should read "will converge in probability to the theoretical probability"; the probability that the observed values do not converge to the theoretical value is 0; but it might happen anyway).

However, the third claim, "According to the law of large numbers, if a large number of six-sided die are rolled, the average of their values (sometimes called the sample mean) is likely to be close to 3.5, with the precision increasing as more dice are rolled." doesn't follow, since we don't know a priori that rolling a six-sided die constitutes a Bernoulli trial. Looking at the context, it seems that the fairness of the die is meant as an ambient assumption, since one of the preceding sentences is "For example, a single roll of a six-sided die produces one of the numbers 1, 2, 3, 4, 5, or 6, each with equal probability."

0
On

You are correct. The Law of Large Numbers does not actually say as much as we would like to believe. Confusion arises because we try to ascribe too much philosophical importance to it. There is a reason that the Wikipedia article puts quotes around 'guarantees' because nobody actually believes that some formal theory (on its own) guarantees anything about the real world. All LLN says is that some notion of probability, without interpretation, approaches 1 -- nothing more, nothing less. It certainly doesn't prove for a fact that relative frequency approaches some probability (what probability?). The key to understanding this is to note that the LLN, as you pointed out, actually uses the term P() in its own statement. I will use this version of the LLN:

"The probability of a particular sampling's frequency distribution resembling the actual probability distribution (to a degree) as it gets large approaches 1."

Interpreting "probability" in the frequentist sense, it becomes this:

Interpret "actual probability distribution": "Suppose that as we take larger samples, they converge to a particular relative frequency distribution..."

Interpret the statement: "... Now if we were given enough instances of n-numbered samplings, the ratio of those that closely resemble (within $\epsilon$) the original frequency distribution vs. those that don't approaches 1 to 0. That is, the relative frequency of the 'correct' instances converges to 1 as you raise both n and the number of instances."

You can imagine it like a table. Suppose for example that our coin has T-H with 50-50 relative frequency. Each row is a sequence of coin tosses (a sampling), and there are several rows -- you're kind of doing several samples in parallel. Now add more columns, i.e. add more tosses to each sequence, and add more rows, increasing the amount of sequences themselves. As we do so, count the number of rows which have a near 50-50 frequency distribution (within some $\epsilon$) , and divide by the total number of rows. This number should certainly approach 1, according to the theorem.

Now some might find this fact very surprising or insightful, and that's pretty much what's causing the whole confusion in the first place. It shouldn't be surprising, because if you look closely at our frequentist interpretation example, we assumed "Suppose for now that our coin has T-H with 50-50 relative frequency." In other words, we have already assumed that any particular sequence of tossings will, with logical certainty, approach a 50-50 frequency split. So is should not be surprising when we say with logical certainty that a progressively larger proportion of these tossing-sequences will resemble 50-50 splits if we toss more in each, and recruit more tossers? It's almost a rephrasing or the original assumption but at a meta-level (we're talking about samples of samples).

So this certainty about the real world (interpreted LLN) only comes from another, assumed certainty about the real world (interpretation of probability).

First of all, with a frequentist interpretation, it is not the LLN that states that a sample will approach the relative frequency distribution -- it's the frequentist interpretation/definition of $P()$ that says this. It sure is easy to think that, though, if we interpret the whole thing inconsistently -- i.e. if we lazily interpret the outer "probability that ... approaches 1" to mean "... approaches certainty" in LLN but leave the inner statement "relative frequency dist. resembles probability dist." up to (different) interpretation. Then of course you get "relative frequency dist. resembles probability dist. in the limit". It's kind of like if you have a limit of an integral of an integral, but you delete the outer integral and apply the limit to the inner integral.

Interestingly, if you interpret probability as a measure of belief, you might get something that sounds less trivial than the frequentist's version: "The degree of belief in 'any sample reflects actual belief measures in its relative frequencies within $\epsilon$ error' approaches certainty as we choose bigger samples." However this is still different from "Samples, as they get larger, approach actual belief measures in their relative frequencies." As an illustration, imagine if you have two sequences $f_n$ and $p_n$. I am sure you can appreciate the difference between $lim_{n \to \infty} P(|f_n - p_n| < \epsilon) = 1$ and $lim_{n \to \infty} |f_n - p_n| = 0$. The latter implies $lim_{n \to \infty} f_n$ = $lim_{n \to \infty} p_n$ (or $=p$ taking $p_n$ to be a constant for simplicity), whereas this is not true for the former. The latter is a very powerful statement, and probability theory cannot prove it, as you suspected.

In fact, you were on the right track with the "absurd belief" argument. Suppose that probability theory were indeed capable of proving this amazing theorem, that "a sample's relative frequency approaches the probability distribution". However, as you've found, there are several interpretations for probability which conflict with each other. To borrow terminology from mathematical logic: you've essentially found two models of probability theory; one satisfies the statement "the rel. frequency distribution approaches $1/2 : 1/2$", and another satisfies the statement "the rel. frequency distribution approaches $1/\pi : (1-1/\pi)$". So the statement "frequency approaches probability" is neither true nor false: it is independent as either one is consistent with the theory. Thus, Kolmogorov's probability theory is not powerful enough to prove a statement in the form "frequency approaches probability". (Now, if you were to force the issue by saying "probability should equal relative frequency" you've essentially trivialized the issue by baking frequentism into the theory. The only possible model for this probability theory would be frequentism or something isomorphic to it, and the statement becomes obvious.)

1
On

I. I agree with you that no version of the Law of Large Numbers tells us something about real life frequencies, already for the reason that no purely mathematical statement tells us anything about real life at all, without first giving the mathematical objects in it a "real life interpretation" (which never can be stated, let alone "proven", within mathematics itself).

Rather, I think of the LLN as something which, within any useful mathematical model of probabilities and statistical experiments, should hold true! In the sense that: If you show me a new set of axioms for probability theory, which you claim have some use as a model for real life dice rolling etc.; and those axioms do not imply some version of the Law of Large Numbers -- then I would dismiss your axiom system, and I think so should you.


II. Most people would agree there is a real life experiment which we can call "tossing a fair coin" (or "rolling a fair die", "spinning a fair roulette wheel" ...), where we have a clearly defined finite set of outcomes, none of the outcomes is more likely than any other, we can repeat the experiment as many times as we want, and the outcome of the next experiment has nothing to do with any outcome we have so far.

And we could be interested in questions like: Should I play this game where I win/lose this much money in case ... happens? Is it more likely that after a hundred rolls, the added number on the dice is between 370 and 380, or between 345 and 350? Etc.

To gather quantitative insight into answering these questions, we need to model the real life experiment with a mathematical theory. One can debate (but again, such a debate happens outside of mathematics) what such a model could tell us, whether it could tell us something with certainty, whatever that might mean; but most people would agree that it seems we can get some insight here by doing some kind of math.

Indeed, we are looking for two things which only together have any chance to be of use for real life: namely, a "purely" mathematical theory, together with a real life interpretation (like a translation table) thereof, which allows us to perform the routine we (should) always do:

Step 1: Translate our real life question into a question in the mathematical model.

Step 2: Use our math skills to answer the question within the model.

Step 3: Translate that answer back into the real life interpretation.

The axioms of probability, as for example Kolmogorov's, do that: They provide us with a mathematical model which will give out very concrete answers. As with every mathematical model, those concrete answers -- say, $P(\bar X_{100} \in [3.45,3.5]) > P(\bar X_{100} \in [3.7,3.8])$ -- are absolutely true within the mathematical theory (foundational issues a la Gödel aside for now). They also come with a standard interpretation (or maybe, a standard set of interpretations, one for each philosophical school). None of these interpretations are justifiable by mathematics itself; and what any result of the theory (like $P(\bar X_{100} \in [3.45,3.5]) > P(\bar X_{100} \in [3.7,3.8])$) tells us about our real life experiment is not a mathematical question. It is philosophical, and very much up to debate. Maybe a frequentist would say, this means that if you roll 100 dice again and again (i.e. performing kind of a meta-experiment, where each individual experiment is already 100 "atomic experiments" averaged), then the relative frequency of ... is greater than the relative frequency of ... . Maybe a Bayesian would say, well it means that if you have some money to spare, and somebody gives you the alternative to bet on this or that outcome, you should bet on this, and not that. Etc.


III. Now consider the following statement, which I claim would be accepted by almost everyone:

( $\ast$ ) "If you repeat a real life experiment of the above kind many times, then the sample means should converge to (become a better and better approximation of) the ideal mean".

A frequentist might smirkingly accept ($\ast$), but quip that it's is true by definition, because he might claim that any definition of such an "ideal mean" beyond "what the sample means converge to" is meaningless. A Bayesian might explain the "ideal mean" as, well you know, the average -- like if you put it in a histogram, see, here is the centre of weight -- the outcome you would bet on -- you know! And she might be content with that. And she would say, yes, of course that is related to relative frequencies exactly in the sense of ($\ast$).

I want to strees that ($\ast$) is not a mathematical statement. It is a statement about real life experiments, which we claim to be true, although we might not agree on why we do so: depending on your philosophical background, you can see it as a tautology or not, but even if you do it is not a mathematical tautology (it's not a mathematical statement at all), just maybe a philosophical one.

And now let's say we do want a model-plus-translation-table for our experiments from paragraph II. Such a model should contain an object which models [i.e. whose "real life translation" is] one "atomic" experiment: that is the random variable $X$, or to be precise, an infinite collection of i.i.d. random variables $X_1, X_2, ...$.

It contains something which models "the actual sample mean after $100,1000, ..., n$ trials": that is $\bar X_n := \frac{1}{n}\sum_1^n X_i$.

And it contains something which models "an ideal mean": that is $\mu=EX$.

So with that model-plus-translation, we can now formulate, within such model, a statement (or set of related statements) which, under the standard translation, appear to say something akin to ($\ast$).

And that is the (or are the various forms of the) Law of Large Numbers. And they are true within the model, and they can be derived from the axioms of that model.

So I would say: The fact that they hold true e.g. in Kolmogorov's Axioms means that these axioms pass one of the most basic tests they should pass: We have a philosophical statement about the real world, ($\ast$), which we believe to be true, and of the various ways we can translate it into the mathematical model, those translations are true in the model. The LLN is not a surprising statement on a meta-mathematical level for the following reason: Any kind of model for probability which, when used as model for the above real life experiment, would not give out a result which is the mathematical analogy of statement ($\ast$), should be thrown out!

In other words: Of course good probability axioms give out the Law of Large Numbers. They are made so that they give them out. If somebody proposed a set of mathematical axioms, and a real-life-translation-guideline for the objects in there, and any model-internal version of ($\ast$) would be wrong -- then that model should be deemed useless (both by frequentists and Bayesians, just for different reasons) to model the above real life experiments.


IV. I want to finish by pointing out one instance where your argument seems contradictory, which, when exposed, might make what I write above more plausible to you.

Let me simplify an argument of yours like this:

(A) A mathematical statement like the LLN in itself can never make any statement about real life frequencies.

(B) Many sources claim that LLN does make statements about real life frequencies. So they must be implicitly assuming more.

(C) As an example, you exhibit a Kolmogorov quote about applying probability theory to the real world, and say that it "seems equivalent to introducing the weak law of large numbers in a particular, slightly different form, as an additional axiom."

I agree with (A) and (B). But (C) is where I want you to pause and think: Were we not in agreement, cf. (A), that no mathematical statement can ever tell us something about real life frequencies? Then what kind of "additional axiom" would say that? Whatever the otherwise mistaken sources in (B) are implicitly assuming, and Kolmogorov himself talks about in (C), it cannot just be an "additional axiom", at least not a mathematical one: Because one can throw in as many mathematical axioms as one wants, they will never bridge the fundamental gap in (A).

I claim the thing that all the sources in (B) are implicitly assuming, and what Kolmogorov talks about in (C), is not an additional axiom within the mathematical theory. It is the meta-mathematical translation / interpretation that I talk about above, which in itself is not mathematical, and in particular cannot be introduced as an additional axiom within the theory.

I claim, indeed, most sources are very careless, in that they totally forget the translation / interpretation part between real life and mathematical model, i.e. the bridge we need to cross the gap in (A); i.e. steps 1 and 3 in the routine explained in paragraph II. Of course it is taught in any beginner's class that any model in itself (i.e. without a translation, without steps 1 and 3) is useless, but it is commonly forgotten already in the non-statistical sciences, and more so in statistics, which leads to all kind of confusions. We spend so much time and effort on step 2 that we often forget steps 1 and 3; also, step 2 can be taught and learned and put on exams, but steps 1 and 3 not so well: they go beyond mathematics, seem to fit better into a science or philosophy class (although I doubt they get a good enough treatment there either). However, if we forget them, we are left with a bunch of axioms linking together almost meaningless symbols; and the remnants of meaning which we, as humans, cannot help applying to these symbols, quickly seem to be nothing but circular arguments.

0
On

I understand OP's concern and I want to illustrate it by an example from geometry.

Pythagorean theorem. Let $V$ be a two-dimensional real vector space with inner product $\langle \cdot, \cdot \rangle$ and induced norm $\| \cdot \|$. In linear algebra we learn that $\langle a,b \rangle = 0$ implies $\| a \|^2 + \| b \|^2 = \| a + b \|^2$, and this result is called Pythagorean theorem. This is Step 2 of the routine in Torsten's answer [1]. Clearly, we would like to know what the connection of this Pythagorean theorem is with right-angled triangles drawn on a real sheet of paper. So we need to think about Steps 1 and 3 of the routine. Generations of students have drawn right-angled triangles and measured the lengths of the legs $a,b$ and the hypothenuse $c$, arriving at the identity $a^2+b^2=c^2$ to some degree of precision. Based on the amount of data, it is plausible to assume that there exists an empirical Pythagorean theorem in the real world. Now we can use the empirical Pythagorean theorem and the standard interpretation (using a rectangular coordinate system) to 'identify' $\| a \|$ with the length of $a$. In this way we obtain an interpretation of the Pythagorean theorem (in $V$) in terms of lengths. Doesn't it then feel wrong to say that the empirical Pythagorean theorem is a consequence of the Pythagorean theorem (in $V$) under the above interpretation?

There is an empirical result often called the empirical law of large numbers or the stability of frequencies, which states, for example, that the relative frequencies of heads in a long sequence of coin tosses 'converge' to some value $p$. In my opinion it is this empirical law which Kolmogorov refers to in the excerpt cited by OP. Afterwards OP argues that since we are using stability of frequencies to interpret probabilities and thereby the law of large numbers (LLN), it feels wrong to say that LLN 'guarantees' stability of frequencies.

I agree that it seems unnecessary to say that LLN is responsible for stability of frequencies whenever stability of frequencies is used to provide an interpretation.


However, stability of frequencies only looks good on paper. Since we cannot make infinitely many observations, this empirical law isn't of much help for determining probabilities in practice. On top of that many problems of practical interest are not as reproducible as a coin toss. I am not a probabilist or statistician, so from here on I have to rely on the opinion of other people.

First of all, let me quote Mark Kac. OP has provided a short excerpt in [4]. Here is how Kac continues.

The applicability of such a theory [probability theory] to natural sciences must ultimately be tested by an experiment. But this is true of all mathematical theories when applied outside the realm of mathematics, and the vague feeling of discomfort one encounters (mostly among philosophers!) when first subjected to statistical reasoning must be attributed to the relative novelty of the ideas.

To me there is no methodological distinction between the applicability of differential equations to astronomy and of probability theory to thermodynamics or quantum mechanics.

It works! And brutally pragmatic as this point of view is, no better substitute has been found. ([2], p.5)

What Kac suggests is a more pragmatic point of view, that of a physicist. Let me quote Krzysztof Burdzy.

[Compared with mathematicians,] physicists have a different idea of a 'proof' – you start with a large number of unrelated assumptions, you combine them into a single prediction, and you check if the prediction agrees with the observed data. If the agreement is within 20%, you call the assumptions proved. ([3], p.41)

I think that Burdzy has made up the figure of 20%, but I am not a physicist. More importantly, we can apply probability theory (including LLN) with any assumption that we deem fit, which makes stability of frequencies in some sense obsolete. As long as we can produce predictions that can be tested, we don't have to worry about the 'vague' link between the model and the real world. Over time and by doing a lot of experiments, we acquire a certain confidence in our claims (if they agree with the observations) and then they become accepted by the math/science community.

All of this is rather difficult to comprehend for a beginner in probability/statistics. In a first probability course, the statistical tools that are needed for the predictions only enter very late or not at all, which may be the reason why students don't see this 'pragmatic' approach to applied probability/statistics. On the other hand, stability of frequencies may still be useful for gaining intuition.

A layman (like myself) gets lost easily in the big philosophical frequentist vs. Bayesian debate(s). As a mathematician, I can accept that the only definition of probability appears in the Kolmogorov axioms and I don't need to know its 'true' meaning in order to learn and apply the theory. My goal in writing this was to provide some consolation for a specific group of people (including myself), i.e. those who have gone through a similar thought process as OP.


[1] Torsten Schoeneberg (Aug 19, 2021)

[2] Mark Kac, "Probability and related topics in physical sciences"

[3] Krzysztof Burdzy, "The search for certainty: on the clash of science and philosophy of probability" (suggested by Bjørn Kjos-Hanssen in [4])

[4] Logical issues with the weak law of large numbers and its interpretation

[5] Is probability and the Law of Large Numbers a huge circular argument?