What is the formal definition of "probability distribution"?

2k Views Asked by At

Can someone please provide a useful reference on the definition of probabilistic distribution.

A very popular site (top of Google search) states:

A probability distribution is a table or an equation that links each outcome of a statistical experiment with its probability of occurrence.

https://stattrek.com/probability-distributions/probability-distribution.aspx

I feel that this definition is very unsatisfactory. I need a better one with a reference.

Thank you!

6

There are 6 best solutions below

6
On BEST ANSWER

To formally introduce the definition of probability distribution one has to have an appropriate notion of probability. Based on the axioms of Probability laid down by Kolmogorov, let's start with a probability space $(\Omega,\mathscr{F},\mu)$ where

  1. $\Omega$ is some non-meaty space (sample space),
  2. $\mathscr{F}$ is a $\sigma$-algebra of subsets of $\Omega$ (measurable events),
  3. and $\mu$ is a positive, countably additive function on $\mathscr{F}$ with $\mu(\Omega)=1$.

Given another measurable space $(R,\mathscr{R})$, a random variable on $\Omega$ taking values on $R$ is a function $X:\Omega\rightarrow R$ such that $X^{-1}(A):=\{x\in\Omega: X(\omega)\in A\}\in\mathscr{F}$ for all $A\in\mathscr{R}$. $X$ is also said to be $(\Omega,\mathscr{F})$-$(R,\mathscr{R})$ measurable.

Definition 1. The distribution of $X$ (which we may denote as $\mu_X$) is defined as the measure on $(R,\mathscr{R})$ induced by $X$, that is $$\begin{align} \mu_X(A):=\mu\big(X^{-1}(A)\big), \quad A\in\mathscr{R}\tag{1}\label{one} \end{align} $$

Note to address one of the concerns of the bounty sponsor Often in the literature (mathematical physics, probability theory, economics, etc) the probability measure $\mu$ in the triplet$(\Omega,\mathscr{F},\mu)$ is also refereed to as probability distribution. This apparent ambiguity (there is no random variable to speak of) can be resolved by definition (1). To see this, consider the identity map $X:\Omega\rightarrow\Omega$, $\omega\mapsto\omega$. $X$ can be viewed a a random variable taking values in $(\Omega,\mathscr{F})$. Since $X^{-1}(A)=A$ for all $A\in\mathscr{F}$ $$\mu_X(A)=\mu(X^{-1}(A))=\mu(A),\quad\forall A\in\mathscr{F}$$


A few examples:

To fixed ideas, consider $(\Omega,\mathscr{F},\mu)=((0,1),\mathscr{B}((0,1)),\lambda_1)$ the Steinhause space, that is $\Omega$ is the unit interval, $\mathscr{F}$ is the Borel $\sigma$-algebra on $(0,1)$, and $\mu$ is the Lebesgue measure $\lambda_1$.

  1. The identity map $X:(0,1)\rightarrow(0,1)$, $t\mapsto t$, considered as a random variable from $((0,1),\mathscr{B}(0,1))$ to $((0,1),\mathscr{B}(0,1))$, has the uniform distribution on $(0,1)$, that is, $\mu_X((a,b])=\lambda_1((a,b])=b-a$ for all $0\leq a<b<1$.

  2. The function $Y(t)=-\log(t)$, considered as a random variable from $((0,1),\mathscr{B}(0,1))$ to $(\mathbb{R},\mathscr{B}(\mathbb{R}))$ has the exponential distribution (with intensity $1$), i.e. $\mu_Y\big((0,x]\big)=1-e^{-x}$

  3. $Z(t)=\mathbb{1}_{(0,1/2)}(t)$, viewed as a random variable from $((0,1),\mathscr{B}(0,1))$ to $(\{0,1\},2^{\{0,1\}})$ has the Bernoulli distribution (with parameter $1/2$), that is $$ \mu_Z(\{0\})=\mu_Z(\{1\})=\frac12 $$

  4. Any $t\in(0,1)$ admits a unique binary expansion $t=\sum^\infty_{n=1}\frac{r_n(t)}{2^n}$ where $r_n(t)\in\{0,1\}$ and $\sum_nr_n(t)=\infty$. It can be shown that the each map $X_n(t)=r_n(t)$ is a Bernoulli random variable (as in example 3). Furthermore, the distribution of $X:(0,1)\rightarrow\{0,1\}^\mathbb{N}$, as a random variable from $((0,1),\mathscr{B}(0,1))$ to the space of sequences of $0$-$1$'s, the latter equipped with the product $\sigma$-algebra (the $\sigma$-algebra generated by sets $\{\mathbf{x}\in\{0,1\}^\mathbb{N}:x(1)=r_1,\ldots,x(m)=r_m\}$, where $m\in\mathbb{N}$ and $r_1,\ldots.r_m\in\{0,1\}$) is such that $\{X_n:n\in\mathbb{N}\}$ becomes an independent endemically distributed (i.i.d.) sequence of Bernoulli (parameter $1/2$) random variable.


Cumulative distribution function

In many applications of Probability, the random variables of interest take values on the real line $\mathbb{R}$. The real line has a natural measurable structure given by the $\sigma$-algebra $\mathscr{B}(\mathbb{R})$ generated by the open intervals in $\mathbb{R}$. This $\sigma$-algebra is known as the Borel $\sigma$-algebra.

  • It turns out that $X$ is a (real-valued) random variable if and only if $\{X\leq a\}:=X^{-1}((\infty,a])\in\mathscr{F}$ for all $a\in\mathbb{R}$.

  • The distribution $\mu_X$ of $X$ can be encoded by the function $$F_X(x):=\mu_X((-\infty,x])=\mu(\{X\leq x\})$$

  • $F_X$ has the following properties: $\lim_{x\rightarrow-\infty}F_X(x)=0$, $F$ is monotone non-decreasing, right-continuous, and $\lim_{x\rightarrow\infty}F_X(x)=1$.

  • It turns out that any function $F$ that has the properties listed above gives rise to a probability measure $\nu$ on the real line. This is based on basic facts of measure theory, namely the Lebesgue-Stieltjes theorem.

  • For that reason, $F_X$ is commonly known as the cumulative distribution function of $X$, and very often it is simply referred to as the distribution function of $X$.


Final Comments:

All these things are now discussed in courses on probability. At the basic level -by no means trivial- (Feller, Introduction to Probability, Vol I), people discuss mainly cumulative distribution functions of random variables; at the more advanced level (Feller, Introduction to Probability, Vol II), people work with more general random variables and so the "general" notion of distribution (as in $\eqref{one}$) is discussed.

0
On

To have a nice definition you need to have a nice object to define, so first of all, instead of speaking of "probability distribution" is better to refer, for example, to

Cumulative Distribution Function -

The Cumulative Distribution Function, CDF (sometimes called also Probability Distribution Function) of a random variable $X$, denoted by $F_X(x)$, is defined to be that function with domain the real line and counterdomain the interval $[0;1]$ with satisfies

$$F_X(x)=\mathbb{P}[X \leq x]=\mathbb{P}[\{\omega:X(\omega)\leq x\}]$$

for every real number $x$

A cumulative distribution function is uniquely defined for each random variable. If it is known, it can be used to find probabilities of events defined in terms of its corresponding random variable.

This definition is taken from: Mood Graybill Boes, Introduction to the Theory of Statistics - McGraw Hill

0
On

Perhaps it might help to define what probability is first. The easiest way to think about it, if you don't want to get into measure-theoretic definitions, is that a probability is a number between $0$ and $1$, assigned to a logical statement, that represents how likely it is to be true. A logical statement can be something like, "It will rain tomorrow" or "A fair coin was tossed $10$ times and came up heads $5$ times." The statement itself can only be true or false, but you don't know for certain; the probability then tells you how likely it is to be true. Such logical statements are called events. A probability measure is a function $P$ defined on the set of all events in your universe and obeying consistency properties such as "if event $A$ implies event $B$, then $P\left(A\right) \leq P\left(B\right)$".

If an event is a logical statement whose truth or falsity you don't know, then a random variable is a number whose value you don't know. If $X$ is such an unknown number, then you can come up with events related to that number, such as "$X \leq x$" for different fixed values of $x$. Since a probability measure maps events into $\left[0,1\right]$, any such event has a probability. The probability distribution of $X$ is characterized by the function

$$F\left(x\right) = P\left(X \leq x\right)$$

defined on all $x\in\mathbb{R}$. This is called the "cumulative distribution function" or cdf. The cdf always exists for every random variable. The distribution can also be characterized using other objects that sometimes can be constructed from the cdf, but the cdf is the fundamental object that determines the distribution.

The above answer is not fully rigorous; in reality, events are defined to be subsets of a certain abstract "sample space" $\Omega$, and in order to define a probability measure, the set of events has to be "rich enough" (i.e., it has to be a sigma-algebra). A random variable is then a function $X:\Omega\rightarrow\mathbb{R}$. Nonetheless, even here you can still define events in terms of logical statements, e.g.,

$$\left\{X\leq x\right\} = \left\{\omega\in\Omega\,:\,X\left(\omega\right)\leq x\right\}$$

is one possible event. For the vast majority of modeling and computational problems that you may encounter in probability, you can solve them using the more intuitive notion of an event as a logical statement. It is quite rare that you actually need to dig into the sample space in detail. If I say that $X$ is normally distributed with mean $0$ and variance $1$, that fully characterizes the cdf of $X$ without really saying anything about $\Omega$ (I am implicitly assuming that some such $\Omega$ exists and $X$ is defined on it, but I don't know anything about the objects $\omega\in\Omega$).

Of course, for a deep understanding of the theory you will need to delve into the measure-theoretic foundation. If you want a good reference on measure-theoretic probability, I recommend "Probability and Stochastics" by Cinlar.

4
On

1: Formal definitions

To start with this question, one should define a probability space: A tuple of three items usually denoted $(\Omega,\mathcal{E},\Bbb{P})$ [or something of this nature].

$\Omega$ is the sample space - the set of all possible outcomes (not to be confused with events!) of our procedure, experiment, whatever. For instance, consider flipping a coin once: In this case, $\Omega=\{\text{H},\text{T}\}$. A random variable $X$ is the "result" of this experiment. You could define $X$ in this case as $$X=\begin{cases} 1 & \text{If coin lands heads}\\ 0 & \text{If coin lands tails} \end{cases}$$ Formally, one can define a measurement $M$ as a bijective map $M:\Omega\to\mathcal{X}$ that maps an outcome of our experiment to a value of the random variable. Here $\mathcal{X}$ is the set of all possible values of $X$. In this coin case, the "measurement" could be writing down a $0$ or $1$ in your notebook if you see a tails or heads accordingly. Bijective means one-to-one: No two outcomes can have the same measurement, and no two measurements could have come from the same outcome.

$\mathcal{E}$ is the event space, which is the set of all subsets (or powerset) of the sample space $\Omega$. In set notation, $\mathcal{E}=\mathcal{P}(\Omega).$ In the coin case mentioned above, $\mathcal{E}=\{\varnothing,\{\text{H}\},\{\text{T}\},\{\text{H},\text{T}\}\}$.

$\mathbb{P}$ is a probability function or probability measure, which is a map or function that maps an event in the event space to a probability. Formally, $\mathbb{P}:\mathcal{E}\to[0,1].$ $\Bbb{P}$ always satisfies three conditions:

1: $\Bbb{P}(e)\in[0,1]~\forall e\in\mathcal{E}$

2: $\Bbb{P}(\varnothing)=0.$

3: $\Bbb{P}(\Omega)=1.$

In words, 1: Every event has a probability. 2: Our experiment must have a result, or, the probability of nothing happening is $0$. 3: Something will happen, or, the probability of getting any result is $1$.

2: Distributions

A probability distribution is a map or function $p$ that assigns a number (positive or zero), not necessarily between $0$ and $1$, to every possible value of $X$. Formally, $p:\mathcal{X}\to\Bbb{R}_{\geq 0}$. In the discrete case, it is quite closely related to the probability measure mentioned before. Let $x\in\mathcal{X}$ be the result of a measurement of some possible outcome, say $x=M(\omega)$ for some $\omega\in\Omega$. It actually turns out that in the discrete case, $$p(x)=\Bbb{P}(\omega).$$ So one might ask: what is the difference between these two closely related things? Well, note that in the continuous case, the above equality does not hold. Since $\Omega$ is uncountably infinite, the probability of any single outcome, or indeed any countable subset of outcomes, is zero. That is, $$\mathbb{P}(\omega)=0$$ regardless of the value of $p(x)$.

In the discrete case, $p$ must satisfy the condition $$\sum_{x\in\mathcal{X}}p(x)=1$$ And in the continuous case $$\int_{\mathcal{X}}p(x)\mathrm{d}x=1$$

How can we interpret the value of $p(x)$? In the discrete case this is rather simple: $p(x)$ is the probability of measuring a value $x$ from out experiment. That is, $$p(x)=\mathbb{P}(X=x).$$

But in the continuous case, one must be more careful with how we interpret things. Consider two possible measurements $x_1$ and $x_2$. If $p(x_1)>p(x_2)$, then $\exists\delta>0$ such that $\forall\epsilon<\delta$ (with $\epsilon>0$), $$\Bbb{P}(X\in[x_1-\epsilon,x_1+\epsilon])>\Bbb{P}(X\in[x_2-\epsilon,x_2+\epsilon])$$ In simple terms, we are more likely to measure a value close to $x_1$ than close to $x_2$.

I would recommend watching 3Blue1Brown's video about probability density functions.

0
On

The term "probability distribution" is ambiguous: it means two different things. One meaning is "probability measure", the precise definition of which is given in any modern probability textbook. The other is one particular way of uniquely specifying a probability measure on the real numbers $\mathbb R$, or on $\mathbb R^n$, namely, the "probability distribution function", a.k.a. "cumulative distribution function".

The intuition behind both is that they describe how "probability mass" is spread out over the space of possibilities. Given a probability measure $\mu$ on $\mathbb R$, one can recover its distribution function via $F(t)=\mu((-\infty,t])$; in fact, there is a theorem to the effect that given a probability distribution function $F$ there is a unique probability measure $\mu$ for which $F(t)=\mu((-\infty,t])$ holds for all $t$. So in a sense the distinction is not that important. Neither concept strictly speaking requires the concept of "random variable", by itself, even though their study is the main use of probability distributions.

This state of affairs, that there are two distinct but similar objects with similar names, arose about 100 years ago, as mathematicians were groping towards generalizations of the Lebesgue integral (such as the Lebegue-Stieltjes integral) and so on. 150 years ago there were various discrete probability distributions (the Poisson, the binomial, etc), and various continuous distributions with densities (the Gaussian, the Cauchy, etc), and it was not clear that they were instances of the same sort of thing. The discovery of the Stieltjes integral was big news then, and more or less finished the measure theory of the real line: if you knew the probability distribution function, you knew (in principle) everything you needed to know, about a real-valued random variable.

One attraction of the more abstract-seeming Kolmogorov version of probability theory was that it applied to such things as random functions, random sequences of events, and so on, not just random points in $\mathbb R^n$.

0
On

One reputable source which is commonly used as a textbook for undergraduates and graduates is Rick Durrett's "Probability: Theory and Examples", which is available as a free PDF at that link.

Many high-school and college level textbooks start by differentiating between "discrete" and "continuous" random variables, and define "probability mass functions" and "probability density functions" specific to these random variables. As @mathematicsstudent1122 requests, Durrett instead defines a "probability distribution" not in terms of a random variable, but a sample space.

Per Durrett, a "probability distribution" on a sample space $\Omega$ is a measure $P$ on $\Omega$ with the property that $P(\Omega) = 1$. "Events" are then just the measurable subsets of $\Omega$, and the "probability of an event" $E \subseteq \Omega$ is just the measure $P(\Omega)$. If $\mathcal{S}$ is some other measure space, an $\mathcal{S}$-valued "random variable" $X$ on $\Omega$ is then a function $X: \Omega \to \mathcal{S}$ which is measurable with respect to $P$.

The first chapter of Durrett's text is devoted to building up the standard relevant machinery of measure theory ($\sigma$-algebras, integration, and so forth). He offers an admirably lucid and concise encapsulation of what differentiates "probability theory" from "measure theory on a space of total measure $1$" at the start of Chapter 2:

"Measure theory ends and probability begins with the definition of independence."

The rest of the text lives up to that level of elegance and insight, and Durrett also offers thought-provoking exercises, including a resolution of the infamous St. Petersburg Paradox (on page 65). Durrett's presentation can be jarringly flippant at times, as exemplified by the following exercise on the Poisson process:

enter image description here

but especially in terms of free resources, you can't do better than Durrett as an introduction to the subject.

Remark: This gives the common definition of a "probability distribution" from the perspective of a working mathematician. Philosophically speaking, what one actually means by a "probability distribution" in everyday life may not exactly correspond to the mathematical formalisms. The Stanford Encyclopedia of Philosophy has an excellent overview of different interpretations of probability, not all of which are equivalent to the standard Kolmogorov axiomatization (which is the basis of Durrett's treatment of the subject, as well as any other textbook on standard probability theory).