Probability Theory: Probability space of a random vector

393 Views Asked by At

I'm having difficulties finding books/explanations on Probability Theory that formalise some examples rigorously, or stay too rigorous and theoretical with little to no examples.

In the book "Pattern Recognition and Machine Learning" by Christopher Bishop, the following easy example to deal with conditional probabilities is presented:

Example

I have a red box (random variable $B = r$) chosen with 40% probability over the the blue box, chosen with 60% probability ($B = b$). Inside the red box there are 2 apples (random variable $F = a$) and 6 oranges ($F = o$), inside the blue box there are 3 apples and 1 orange. Once a box has been selected (red or blue) the probability of choosing any of the fruits inside, is equal (see image). The author then goes and explains Bayes' theory (conditional probabilities) with it.

I have the following questions to that example:

  1. I want to define very rigorously how the probability space looks like here, specially because I couldn't find anywhere how the probability space for a multivariate problem looks like.

I assume in this case it looks like this: Sample set: $\Omega = \{(B=r,F=a), (B=r,F=o), (B=b,F=a), (B=b,F=o)\}$, Event set: $\mathcal{F} = \mathfrak{P}(\Omega)$ (power set of $\Omega$), and $P: (B,F) \longrightarrow [0,1]$.

What bothers me here is first, the notation of the events. The way I understand, B and F are already random variables, so why do they adopt anything else but numerical values (per definition, random variables can only get numerical values as a function upon a certain sample $\omega \in \Omega $). Often many books set random variables to non-numerical values.

  1. The probability measure in this probability space is defined to take a vector as an input, which means that writing something like $P(B=r)$ is pedantically incorrect, one would have to write $P(B=r):= P((B=r,F=f)| f = a \lor f = o))$

  2. Building upon the last question, how is the conditional probability defined, in terms of its rigorous definition in the probability space. What exactly is it, and how does it work?

The reason I'm writing this is because it seems to me that in engineering literature, the measure $P(\cdot)$ gets just thrown around lightly as an intuitive term for "Probability", but the moment one wants to do some more sophisticated things it's hard to know exactly what one is doing, because it is never explicitly defined.

Greatly appreciated!

1

There are 1 best solutions below

3
On BEST ANSWER

Remember a probability space consists of three things: $(\Omega,\mathcal F,P)$, where $\Omega$ is the sample space; $\mathcal F$ is the event space, which works as a $\sigma$-algebra; and $P$ is the probability measure, a function $P:\mathcal F\to[0,1]$ under certain conditions.

A random variable $X$ is a real-valued measurable function $X:\Omega\to\Bbb R$ that "translates" the sample space to numerical values, thus modelling the random experiment.

Once we have a random variable $X$, we can define a probability measure $\mathbb P$ on the Borel $\sigma$-algebra of $\Bbb R$ somehow associated to the probability function $P$ we had before. Given a Borel set $A\subset\Bbb R$, we define its probability as $\mathbb P(A)=P(X^{-1}(A))$. Sometimes we don't use $\mathbb P$ and as an abuse of notation we say the probability of $A$ is $P(X\in A)$. As a side note: If $\omega\in\Omega$ is an outcome, then the event where $\omega$ is the only outcome is $\{\omega\}$, so when we write $P(\omega)$ we really mean $P(\{\omega\})$, since $P$ acts on events.

The $B$ and $F$ in your question are random variables, as you said, but there is no need for them. You could just define $\Omega_1=\{r,b\}$ and $\Omega_2=\{a,o\}$, each of them with its power set as the event space and two probability measure $P_1$ and $P_2$. If you want to use random variables, you have to define them, each of them in their respective sample space: $B:\Omega_1\to\Bbb R$ and $F:\Omega_2\to\Bbb R$, so there would be two probability measures $\mathbb P_1$ and $\mathbb P_2$ associated to $P_1$ and $P_2$ respectively. So, if you decide to use $B$ and $F$, what numbers on $\mathbb R$ are $B(r),B(b),F(a),F(o)$? As I said, you would have to define them, but there is no need for that; you can just use $P_1$ and $P_2$.

And how do $P_1$ and $P_2$ work? Well, the information the example give us has to be interpreted. The first part is easy: it's telling us how $P_1$ works (since it's literally telling us the probability of choosing each box), so $P_1(r)=0.4,P_1(b)=0.6$. But the second part can't be interpreted as how $P_2$, since $P_2$ acting on $\Omega_2$ depends on $P_1$ acting on $\Omega_1$, so we can't really define it yet (if we used law of total probability to define it, we would be cheating, since this law needs the sample space $\Omega=\Omega_1\times\Omega_2$ to talk about the intersection of events from different sample spaces' power sets, or even more, to talk about conditional probability). The information of the second part tells us the probability of choosing each fruit given that we know which box we're choosing it from, but how could we interpret this? It seems like conditional probability, but, as I said, we need something more to handle this.

To interpret it correctly first we have to construct $\Omega=\{(r,a),(r,o),(b,a),(b,o)\}$ (note that $\Omega=\Omega_1\times\Omega_2$). We have its power set as $\sigma$-algebra and the probability measure $P:\Omega\to[0,1]$ acting on it (we can't say how yet). By definition, given two events $U,V$ in the power set, we have $P(V\mid U)=\dfrac{P(V\cap U)}{P(U)}$.

Here is where we have to make an interpretation. For example, we could take $V=\{(r,a),(b,a)\}$ and $U=\{(r,a),(r,o)\}$, and interpret (in a very intuitive way) that $V$ is the same as the event $\{a\}$ in the power set of $\Omega_2$, and that $U$ is the same as the event $\{r\}$ in the power set of $\Omega_1$. In that case $P(V\mid U)$ really means the probability of choosing an apple, $\{a\}$, given that we're taking the fruit from the red box, $\{r\}$; so we can express that as $P(a\mid r)$. Moreover, since we interpret $U$ as $\{r\}$ in their respective $\sigma$-algebras, we know that $P(U)=P_1(r)$. The same goes for $P(a\mid b)$, $P(o\mid r)$, $P(o\mid b)$ and $P_1(b)$.

We know all this values, so we can calculate the values that $P$ gives: $P\big((r,a)\big)=P(U\cap V)=P(a\mid r)·P_1(r)=0.25·0.4=0.1$. Similarly, $P\big((r,o)\big)=0.75·0.4=0.3$, $P\big((b,a)\big)=0.75·0.6=0.45$ and $P\big((b,o)\big)=0.25·0.6=0.15$.

Now we could use law of total probability to get the values given by $P_2$, doing the same kind of interpretation we did for events of the power set of $\Omega_1$, except now for events of the power set of $\Omega_2$. However, since we know how $P$ works already, we can just use the interpretation directly, knowing the properties of a probability measure: $P_2(a)=P\big((r,a),(b,a)\big)=P\big((r,a)\big)+P\big((b,a)\big)=0.55$; and then $P_2(o)=0.45$.

To conclude, if you wanted to "translate"/model $\Omega$ into $\Bbb R^2$ you would need to consider a random vector $Z:\Omega\to\mathbb R^2$, which would be related to $B$ and $F$ by $Z\big((x,y)\big)=\big(B(x),F(y)\big)\in\Bbb R^2$ and you would have a probability function $\Bbb P:\Bbb R^2\to[0,1]$ associated to $P$ (which in this case is not the product of $P_1$ and $P_2$, since $B$ and $F$ are not independent).