Probability Theory

Probability Theory#

This book walks through essential concepts of probability theory with practical examples and aims to make probability theory graspable for roboticists and machine learning engineers.

At the end of this book, you will understand …

Basic terms of probability theory
Random events and how they interact
Quantities of interest in probabilistic modeling
Learning probabilistic models from data
Applying probabilistic models to practical problems

Concepts of Probability#

The first section walks through the essential concepts of probability that lay the foundation to understand modern probabilistic modelling. You should get familiar with the sigma-algebra first. For that, I recommend this book.

The reason why the sigma-algebra is the set of interest for probability theory is, bluntly speaking, knowing the probability of every atomic event is knowing the probability of every possible event. Practically, the sigma-algebra used in real world modeling is always the powerset of some elementary events. While it would be possible to construct a sigma-algebra that is not the powerset of its elementary events, it serves up to everything I have seen so far, no purpose to the real world.

The sigma algebra of interest for practical probabilistic reasoning is the product sigma-algebra. Next, we define the probabilities of events in Definition 1.

Probability Measure#

Definition 1 (Probability Measure)

Let $(E, \Im)$ be a $\sigma$-algebra. A non-negative real function $P \rightarrow \mathbb{R}_{0, +}$ is called a measure if it satisfies the following properties:

$P(\emptyset) = 0$
For any countable sequence $\{A_i \in \Im \}_{i=1,...,}$ of pairwise disjoints sets $A_i \cap A_j = \emptyset$ if $i \neq j, P$ satisfies countable additivity ($\sigma$-additivity):

\[P \left( \bigcup_{i=1}^\infty A_i \right) = \sum_{i=1}^\infty P(A_i)\]
$P(A \cup B) = P(A) + P(B) - P(A,B)$
$P(E) = 1$

The triple $(E, \Im, P)$ is called a probability space.

The probability measure just tells that for non-intersecting sets, you can determine the probability of the union by adding the atomic probabilities. Furthermore, for intersecting sets you have to subtract the intersection because it is added in there twice otherwise. You may realise that in the random events package the composite random event always contains a disjoint union of events. The sigma-additivity is the sole reason for that.

A common way to visually think about those things is through venn diagrams. The size of the objects depicted in the diagrams is proportionate to the probability of the respective event.

from random_events.set import SetElement, Set
from random_events.variable import Symbolic, Continuous, Integer
from random_events.product_algebra import Event, SimpleEvent
from random_events.interval import *
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import itertools

x = Continuous("x")
y = Continuous("y")

event_a = SimpleEvent({x: closed(0, 2), y: closed(0, 2)}).as_composite_set()
event_b = SimpleEvent({x: closed(2.5, 3), y: closed(1, 2.5)}).as_composite_set()
fig = go.Figure(event_a.plot(color="red") + event_b.plot(color="blue"), event_a.plotly_layout())
fig.update_layout(title="Two disjoint events")
fig.show()

In this venn diagram, we can see that event $A$ and event $B$ are not intersecting. Hence, the probability (size) of both events can be added to get the size of their union (sigma-additivity).

event_a.simple_sets[0][x] = closed(2, 2.75)
fig = go.Figure(event_a.plot(color="red") + event_b.plot(color="blue"), event_a.plotly_layout())
fig.update_layout(title="Two non-disjoint events")
fig.show()

In the second venn diagram, we see that the events are intersecting. Hence, the size of their union is smaller by the sum of both sizes by exactly the intersecting part (3. Axiom).

In general, thinking of probabilities as size of sets is a good intuition. Most things you can think about for geometrical sizes of shapes also apply to probability. However, probabilities never exceed unity (4. Axiom). From the definition of the probability measure, we can derive Theorem 1.

Sum Rule#

Theorem 1 (Sum Rule)

From $A + \neg A = E$ we get

\[P(A) + P(\neg A) = P(E) = 1 \text{, thus } P(A) = 1 - P(\neg A).\]

Furthermore, from $A = A \cap (B + \neg B)$, using the notation $P(A, B) = P(A \cap B)$ for the joint probability of $A$ and $B$, we get the Sum Rule

\[P(A) = P(A,B) + P(A, \neg B).\]

The sum rule is a way to express a belief over only the event $A$ if said probability is not known. Instead, the joint probability of $A$ and $B$ is known, and by adding (marginalizing) over all cases of $B$ we can get the probability of $A$.

fig = make_subplots(rows=1, cols=2, subplot_titles=["Event A and B", "Event A and not B"])

event_a_and_b = event_a & event_b
event_a_and_not_b = event_a & ~event_b
fig.add_traces(event_a_and_b.plot(), rows=1, cols=2)
fig.add_traces(event_a_and_not_b.plot(), rows=1, cols=1)
fig.update_xaxes(range=[2, 2.75], )
fig.update_yaxes(range=[0, 2], )
fig.update_layout(title="Visualization of the Sum Rule")
fig.show()

In the interactive plot (play around with it as you want) you can observe that the event $A$ is precisely described by the union of both events $A, \neg B$ and $A, B$. Furthermore, both events are disjoint and the probability can be calculated using sigma$additivity.

Before we dive into the next block of definitions, there is a little preface. As these next theorems are so strongly intertwined with each other, there is not much sense of making a plot for each of them. There is an interactive plot after Bayes theorem that gives light to how these definitions work. Onward, to Definition 2.

Conditional Probability#

Definition 2 (Conditional Probability)

If $P(A) > 0$, the quotient

\[P(B|A) = \frac{P(A, B)}{P(A)}\]

is called the truncated probability of $B$ given $A$. It immediately gives the product rule

\[P(A,B) = P(B|A)P(A) = P(A|B)P(B).\]

It is easy to show that $P(B|A) \geq 0, P(E|A) = 1$ and for $B\cap C = \emptyset$, we have $P(B+C |A) = P(B|A) + P(C|A)$.

Thus, for a fixed $A$, $(E, \Im, P(\cdot|A))$ is a probability space as well.

Instead of explaining this concept in my own words, I will refer to Todd Kemps phrasing about it:

We often think of truncated probability intuitively in terms of a two-stage experiment. In the first stage of the experiment, we see whether one particular event $A$ has occurred or not. If it has, it may influence whether a second event $B$, which we’re going to measure in a second stage, will occur or not. Therefore, we may want to update our information on probabilities of the later events given information about whether the earlier event has occurred or not. In this language we’ll refer to the original probability measure $P(B)$ as the prior probability of an event $B$, and after we observe that event $A$ has occurred, we refer to the new updated truncated probability as the posterior probability of that same event $B$. In more practical applications $A$ is referenced as the evidence since it is an event that evidently happened and $B$ is the query, since it is the event of interest. There are two very elementary but extraordinarily important results that come from that line of thinking the so-called law of total probability and most critically, Bayes theorem.

Todd Kemp, 30.1 Conditional Probability

Law of total probability#

Theorem 2 (Law of total probability)

Let $A_1 + A_2 + \cdots + A_n = E$ and $A_i \cap A_j = \emptyset$ if $i \neq j$.

Then for any $X \in \Im,$

\[P(X) = \sum_{i=1}^n P(X|A_i)P(A_i).\]

Because $X = E \cap X = \bigcup_{i=1}^n (A_i \cap X),$ we get that

\[P(X) = \sum_{i=1}^n P(A_i, X) = \sum_{i=1}^n P(X|A_i) P(A_i).\]

Bayes Theorem#

Theorem 3 (Bayes Theorem)

Let $A_1 + A_2 + \cdots + A_n = E$ and $A_i \cap A_j = \emptyset$ if $i \neq j$.

Then for any $X \in \Im,$

\[P(A_i|X) = \frac{P(X|A_i)P(A_i)}{\sum_{j=1}^n P(X|A_j) P(A_j)}.\]

In the interactive plot, you can see the event $X$ and the disjoint events $A_1, A_2, A_3, A_4$. The intersection of $X$ and $A_i$ is also plotted. You can see that the sum of the intersections is equal to the event $X$. This is the law of total probability. Conditioning on an event can be seen as zooming into the events and only accounting for what is inside the zoomed area. The Bayes theorem is a way to reverse the conditioning. It is a way to update the prior belief of an event given new evidence and just a consequence of the definitions given so far.

Random Variables#

While there exists a definition for random variables, I find it quite confusing and disconnected to how things are approached in practice. Instead, I will give a more practical and engineering motivated definition on what a random variable is.

Definition 3 (Random Variable)

A random variable is a term in a language that can take different values. The range of values of a random variable, $\text{dom}(X)$, is exhaustive and mutually exclusive.

A tuple $X$ of random variables with $X=\left<X_{1}, \ldots, X_{n}\right>$ is a complex random variable with the domain

\[\text{dom}(X)=\text{dom}(X_{1}) \times \ldots \times \text{dom}(X_{n}).\]

A possible world $x=\left<X_1=v_1,\ldots,X_n=v_n\right>$ specifies a value assignment for each random variable $X_i$ under consideration.

The set $\mathcal{X}$ of all possible worlds is called the universe.

Let’s look at a practical example. In real world scenarios, data is most of the time provided as a table, dataframe, array, etc. The rows of such an object are referred to as samples or instances and the columns as variables or features.

from sklearn.datasets import load_iris

data = load_iris(as_frame=True)

dataframe = data.data
target = data.target.astype(str)
target[target == "0"] = "Setosa"
target[target == "1"] = "Versicolour"
target[target == "2"] = "Virginica"
dataframe["plant"] = target
dataframe

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	plant
0	5.1	3.5	1.4	0.2	Setosa
1	4.9	3.0	1.4	0.2	Setosa
2	4.7	3.2	1.3	0.2	Setosa
3	4.6	3.1	1.5	0.2	Setosa
4	5.0	3.6	1.4	0.2	Setosa
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	Virginica
146	6.3	2.5	5.0	1.9	Virginica
147	6.5	3.0	5.2	2.0	Virginica
148	6.2	3.4	5.4	2.3	Virginica
149	5.9	3.0	5.1	1.8	Virginica

150 rows × 5 columns

In this example, we see a dataset with 150 instances and 5 variables. Probability distributions can now be defined over events that are constructed over the complex random variable of this dataset.

Independence#

Definition 4 (Independence)

We say that $A$ is independent of $B$ is $(PA|B) = P(A)$ or equivalently that

\[P(A \cap B) = P(A) P(B).\]

Notation: $A \perp \!\!\! \perp B$. Information about $B$ does not give information about $A$ and vice versa.

In Definition 4 $A$ and $B$ could be both events or event entire random variables. Let’s look at a practical example where we construct a joint probability distribution over colors and shapes.

from probabilistic_model.distributions.multinomial import MultinomialDistribution
import numpy as np
from enum import IntEnum
from random_events.set import Set

class Color(IntEnum):
    BLUE = 0
    RED = 1
    

class Shape(IntEnum):
    CIRCLE = 0
    RECTANGLE = 1
    TRIANGLE = 2


color = Symbolic("color", Set.from_iterable(Color))
shape = Symbolic("shape", Set.from_iterable(Shape))

probabilities = np.array([[2 / 15, 1 / 15, 1 / 5],
                          [1 / 5, 1 / 10, 3 / 10]])

distribution = MultinomialDistribution((color, shape), probabilities)
color_event = SimpleEvent({color: Color.BLUE}).as_composite_set()
shape_event = SimpleEvent({shape: (Shape.CIRCLE, Shape.TRIANGLE)}).as_composite_set()
joint_event = color_event & shape_event
print(f"P({joint_event}) = {distribution.probability(joint_event)}")
print(
    f"P({color_event}) * P({shape_event}) = {distribution.probability(color_event) * distribution.probability(shape_event)}")

P({
    color ∈ 0
}) = 0.4
P({
    color ∈ 0
}) * P({
    shape ∈ 0 u 2
}) = 0.3333333333333334

We can see that both events are independent as their joint probability can be decomposed to the product of the marginal probabilities. However, in most applications one is interested in independence between entire variables (dimensions). Checking independence over entire dimensions requires either analysis of the computation or checking independence over all possible events. Let’s verify if the variables are independent.

for color_value, shape_value in itertools.product(color.domain.simple_sets, shape.domain.simple_sets):
        joint_event = SimpleEvent({color: color_value, shape: shape_value}).as_composite_set()
        color_event = SimpleEvent({color: color_value}).as_composite_set()
        shape_event = SimpleEvent({shape: shape_value}).as_composite_set()
        print(f"P({joint_event}) = {distribution.probability(joint_event)}")
        print(f"P({color_event}) * P(shape={shape_event}) = {distribution.probability(color_event) * distribution.probability(shape_event)}")
        print(np.allclose(distribution.probability(joint_event), 
                          distribution.probability(color_event) * distribution.probability(shape_event)))
        print("-" * 80)

P({
    color ∈ 0,
    shape ∈ 0
}) = 0.13333333333333333
P({
    color ∈ 0
}) * P(shape={
    shape ∈ 0
}) = 0.13333333333333336
True
--------------------------------------------------------------------------------
P({
    color ∈ 0,
    shape ∈ 1
}) = 0.06666666666666667
P({
    color ∈ 0
}) * P(shape={
    shape ∈ 1
}) = 0.06666666666666668
True
--------------------------------------------------------------------------------
P({
    color ∈ 0,
    shape ∈ 2
}) = 0.2
P({
    color ∈ 0
}) * P(shape={
    shape ∈ 2
}) = 0.2
True
--------------------------------------------------------------------------------
P({
    color ∈ 1,
    shape ∈ 0
}) = 0.2
P({
    color ∈ 1
}) * P(shape={
    shape ∈ 0
}) = 0.20000000000000004
True
--------------------------------------------------------------------------------
P({
    color ∈ 1,
    shape ∈ 1
}) = 0.1
P({
    color ∈ 1
}) * P(shape={
    shape ∈ 1
}) = 0.10000000000000002
True
--------------------------------------------------------------------------------
P({
    color ∈ 1,
    shape ∈ 2
}) = 0.3
P({
    color ∈ 1
}) * P(shape={
    shape ∈ 2
}) = 0.30000000000000004
True
--------------------------------------------------------------------------------

As we can see, the entire variables are independent.

The definition of independence can be expanded to truncated independence. Formally,

Conditional Independence#

Definition 5 (Conditional Independence)

Two variables (events) $A$ and $B$ are conditionally independent given variable (event) $C$, if and only if their truncated distribution factorizes,

\[P(A,B|C) = P(A|C) P(B|C).\]

In that case we have $P(A|B, C) = P(A|C)$, i.e. in light of information $C$, $B$ provides no (further) information about $A$.

Notation: $A \perp \!\!\! \perp B \, | \, C$.

Let’s explore this in another example.

class Size(IntEnum):
    SMALL = 0
    LARGE = 1
    
size = Symbolic("size", Set.from_iterable(Size))

probabilities = np.array([[[2 / 30, 1 / 30, 1 / 10], [0, 0.3, 0.05]],
                          [[1 / 10, 1 / 20, 3 / 20], [ 0.15, 0, 0.,]]])
distribution = MultinomialDistribution((color, size, shape), probabilities)

small_event = SimpleEvent({size: Size.SMALL}).as_composite_set()
p_small = distribution.probability(small_event)

large_event = ~small_event
p_large = distribution.probability(large_event)

color_event = SimpleEvent({color: Color.BLUE}).as_composite_set()
color_and_small = color_event & small_event
color_and_large = color_event & large_event
p_color_and_small = distribution.probability(color_and_small)
p_color_and_large = distribution.probability(color_and_large)

shape_event = SimpleEvent({shape: Shape.CIRCLE}).as_composite_set()
shape_and_small = shape_event & small_event
shape_and_large = shape_event & large_event
p_shape_and_small = distribution.probability(shape_and_small)
p_shape_and_large = distribution.probability(shape_and_large)

shape_and_color_and_small = shape_event & color_event & small_event
p_shape_and_color_and_small = distribution.probability(shape_and_color_and_small)

shape_and_color_and_large = shape_event & color_event & large_event
p_shape_and_color_and_large = distribution.probability(shape_and_color_and_large)

p_shape_given_small = p_shape_and_small / p_small
p_color_given_small = p_color_and_small / p_small
p_shape_and_color_given_small = p_shape_and_color_and_small / p_small

p_shape_given_large = p_shape_and_large / p_large
p_color_given_large = p_color_and_large / p_large
p_shape_and_color_given_large = p_shape_and_color_and_large / p_large

print(f"P(Shape|Small) = {p_shape_given_small}")
print(f"P(Color|Small) = {p_color_given_small}")
print(f"P(Shape, Color|Small) = {p_shape_and_color_given_small}")
print(np.allclose(p_shape_given_small * p_color_given_small, p_shape_and_color_given_small))

print(f"P(Shape|Large) = {p_shape_given_large}")
print(f"P(Color|Large) = {p_color_given_large}")
print(f"P(Shape, Color|Large) = {p_shape_and_color_given_large}")
print(np.allclose(p_shape_given_large * p_color_given_large, p_shape_and_color_given_large))

P(Shape|Small) = 0.6333333333333333
P(Color|Small) = 1.1
P(Shape, Color|Small) = 0.6333333333333333
False
P(Shape|Large) = 0.6333333333333333
P(Color|Large) = 1.1
P(Shape, Color|Large) = 0.6333333333333333
False

In this example, we can observe that the variables are conditionally independent if the size of the object is small. They are conditionally dependent if the size of the object is large. In contrast to the previous example, the variables are not independent over the entire domain, only certain configurations (events) of the variables are independent.

Continuous Random Variables#

We just saw how to construct a joint probability table by enumerating all elementary events. However, regarding continuous spaces, this is not possible since one cannot enumerate all real numbers.

For real spaces, the probability density function is used. Firstly, though, we have to understand events in continuous spaces. In continuous spaces, we are interested in events that describe intervals on the real line. For instance, $P(0 \leq x \leq 1) $ is the probability that $x$ is between $0$ and $1$. Giving a formal description of the Borel algebra requires topological arguments, which I will not discuss. For most applications, it is enough to understand that we want to reason about intervals.

Probability Density Function (PDF)#

Definition 6 (Probability Density Function (PDF))

Let $\mathfrak{B}$ be the Borel sigma-algebra on $\mathbb{R}^d$. A probability measure $P$ on $(\mathbb{R}^d, \mathfrak{B})$ has a density $p$ if $p$ is a non-negative (Borel) measurable function on $\mathbb{R^d}$ satisfying for all $B \in \mathfrak{B}$

\[P(B) = \int_B p(x) dx =: \int_B p(x_1,...,x_d) dx_1 ... dx_d.\]

Bluntly speaking, the pdf is an always positive function that is high where we think that this region is highly likely and low where we think it’s not.

The integral over a pdf is called the cumulative distribution function and is the object we use to calculate the probability of an event. Formally,

Cumulative Distribution Function (CDF)#

Definition 7 (Cumulative Distribution Function (CDF))

For probability measures $P$ on $(R^d, \mathfrak{B})$, the cumulative distribution function is the function

\[F(x) = P\left( \prod_{i=1}^d (X_i < x_i) \right).\]

If $F$ is sufficiently differentiable, then $P$ has a density given by

\[p(x) = \frac{\partial^dF}{\partial x_1, ..., \partial x_d}.\]

In this sense, the cdf over many variables describes the probability of the event that every variable is below some value (vector). For example, if we evaluate $F \begin{pmatrix} 1 \\ 2 \end{pmatrix}$ we calculate the probability of the event that $x_1 < 1 \wedge x_2 < 2$.

from probabilistic_model.distributions.gaussian import GaussianDistribution
normal = GaussianDistribution(Continuous("x"), 0, 1)
fig = go.Figure(normal.plot(), normal.plotly_layout())
fig.for_each_trace(lambda trace: trace.update(visible='legendonly') if trace.name in ["Expectation", "Mode"] else ())

The figure above visualizes the objects we just discussed for a univariate normal distribution.

[CVVdB20]

Y Choi, Antonio Vergari, and Guy Van den Broeck. Probabilistic circuits: a unifying framework for tractable probabilistic models. UCLA. URL: http://starai.cs.ucla.edu/papers/ProbCirc20.pdf, 2020.

[GdCP+24]

Gennaro Gala, Cassio de Campos, Robert Peharz, Antonio Vergari, and Erik Quaeghebeur. Probabilistic integral circuits. In International Conference on Artificial Intelligence and Statistics, 2143–2151. PMLR, 2024.

[Hen20]

Philipp Hennig. Probabilistic machine learning. lecture course, University of Tuebingen, 2020. https://uni-tuebingen.de/en/180804.

[Jaw03]

James W. Jawitz. Moments of truncated continuous univariate distributions. Advances in Water Resources, 27:269–281, 2003.

[khttpsmathstackexchangecomusers255727kccu]

kccu (https://math.stackexchange.com/users/255727/kccu). How does one formally define sampling from a probability distribution? Mathematics Stack Exchange. URL:https://math.stackexchange.com/q/2336295 (version: 2017-06-26). URL: https://math.stackexchange.com/q/2336295, arXiv:https://math.stackexchange.com/q/2336295.

[LAB24]

Anji Liu, Kareem Ahmed, and Guy Van den Broeck. Scaling tractable probabilistic circuits: a systems perspective. arXiv preprint arXiv:2406.00766, 2024.

[Mar24]

Pedro Zuidberg Dos Martires. Probabilistic neural circuits. arXiv preprint arXiv:2403.06235, 2024.

[NPSB23]

Daniel Nyga, Mareike Picklum, Tom Schierenbeck, and Michael Beetz. Joint probability trees. arXiv preprint arXiv:2302.07167, 2023.

[Oga22]

Haruhiko Ogasawara. On moments of folded and truncated multivariate normal distributions. Communications In Statistics — Theory and Methods URL: https://www.tandfonline.com/doi/full/10.1080/03610926.2020.1867742, 51(19):6834 – 6862, 2022.

[PNR+21]

George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. The Journal of Machine Learning Research. URL: https://arxiv.org/abs/1912.02762, 22(1):2617–2680, 2021.

[PLV+20]

Robert Peharz, Steven Lang, Antonio Vergari, Karl Stelzner, Alejandro Molina, Martin Trapp, Guy Van den Broeck, Kristian Kersting, and Zoubin Ghahramani. Einsum networks: fast and scalable learning of tractable probabilistic circuits. In International Conference on Machine Learning, 7563–7574. PMLR, 2020.

[PSA19]

Moshe Pollak and Michal Shauly-Aharonov. A double recursion for calculating moments of the truncated normal distribution and its connection to change detection. Methodology and Computing in Applied Probability URL: https://link.springer.com/article/10.1007/s11009-018-9622-7, 21(53):889–906, 2019.

[US20]

UCLA-StarAI. Probabilistic circuits: representations, inference, learning and theory (tutorial at ecml-pkdd 2020). YouTube https://www.youtube.com/watch?v=2RAG5-L9R70, 2020.