Probabilistic Circuits

Probabilistic Circuits#

Unifying Probabilistic Modeling: There are various formalisms used for tractable probabilistic models. Probabilistic circuits offer a general and unified framework that can encompass these diverse approaches. This fosters a more cohesive understanding and facilitates advancements in the field. The main resource for this section is [CVVdB20].

Units#

Probabilistic Circuits describe computational graphs that can be used to represent probability distributions. The computational graphs are used to analyze the cost of answering queries.

Formally,

Definition 15 (Probabilistic Circuit)

A probabilistic circuit (PC) defined over variables $X$ is represented by a parameterized Directed Acyclic Graph (DAG) with a single root node $n_r$. Every leaf node in the DAG represents an input node that defines a primitive distribution over some variable $x \in X$. Every inner node $n$ is either a sum node or a product node, which merges the distributions encoded by its children, denoted $ch(n)$, to construct more complex distributions. The distribution represented by every node is defined recursively as:

\[\begin{align*} p_n(x) := \begin{cases} f_n(x) & \text{n is an input node,}\\ \prod_{c \in ch(n))} p_c(x) & \text{n is a product node }\\ \sum_{c \in ch(n))} \theta_{c, n} \cdot p_c(x) & \text{n is a sum node.} \end{cases} \end{align*}\]

where $f_n(x)$ is an univariate input distribution (e.g., Gaussian, Categorical), and $\theta_{n,c}$ denotes the parameter corresponding to edge $(n, c)$. Intuitively, sum nodes model mixtures of their input distributions, which require the mixture log_weights to be in the probability simplex: $\sum_{c \in ch(n)} \theta_{n,c} = 1$ and $\forall c \in ch(n), \theta_{n,c} \geq 0$. And product nodes build factorized distributions over their inputs. The size of a PC, denoted $|p|$, is the number of edges in its DAG. [LAB24]

Properties#

The cost of calculating the quantities described in Queries can now be described with respect to properties of the units in the circuit. While [CVVdB20] describes and proofs these properties in detail, I will give a brief overview here.

Decomposability#

Whenever a query requires an integral decomposability is the most important property. Before we look at this property, we have to define the scope of a unit.

Definition 16 (Scope)

Let $C = (G, \theta)$ be a PC with variables $X$. The computational graph $G$ is equipped with a scope function $\phi$ which associates to each unit $n \in G$ a subset of $X$, i.e., $\phi(n) \subseteq X$. For each non-input unit $$ n \in G, \phi(n) = \bigcup_{c \in ch(n)} \phi(c) $$. The scope of the root of $C$ is $X$. [CVVdB20]

Intuitively, the scope of a unit is all variables that are involved in the computation of the unit.

Definition 17 (Decomposability)

A product node $n$ is decomposable if the scopes of its input units do not share variables: $$ \phi(c_i) \cap \phi(c_j) = \emptyset, \forall c_i, c_j \in ch(n), i \neq j. $$

A PC is decomposable if all of its product units are decomposable. [CVVdB20]

Next to decomposability being very important, it is also very restrictive and forbids the use of something like a linear layer.

Decomposability is necessary the factor rule

\[ \int \int p_1(x) \cdot p_2(y) dx dy = \int p_1(x) dx \cdot \int p_2(y) dy \]

only applies if the functions do not share variables.

If they do, a haunting formula from your calculus class will come back to you

\[ \int u(x) \cdot v(x) dx = \int u(x) dx \cdot \int v(x) dx + \int u'(x) \cdot v'(x) dx. \]

Using the $uv$ substitution is often intractable and requires the use of numerical methods or unbounded many nested substitutions. You perhaps remember the problems in figuring out the correct substitution and the resulting mess of terms from your studies of calculus.
Hence, decomposability is necessary for the queries Marginal Distributions, Conditionals and Moments.

Smoothness#

The next important but less severe property is smoothness.

Definition 18 (Smoothness)

A sum node $n$ is smooth if its inputs all have identical scopes:

\[ \phi(c) = \phi(n), \forall c \in ch(n). \]

A circuit is smooth if all of its sum units are smooth. [CVVdB20]

We usually want smoothness to ensure a proper integral. Long story short if we have a non-smooth sum unit, we would end up with an integral that looks like

\[\int \int p(x) dx dy\]

and hence, it has no useful probabilistic interpretation.

Smoothness is less severe than decomposability since it can be easily ensured.

Smoothness is also a property needed for integration and hence required for the queries Marginal Distributions, Conditionals and Moments.

Determinism#

The last interesting property is determinism. Before we can define determinism, we have to define the support of a distribution.

Definition 19 (Support)

The support of a distribution is the set of all elementary events that have a non-zero probability of occurring

\[ supp(p) = \{x \in X | p(x) > 0\}. \]

Definition 20 (Support as Random Event)

The support of a smooth and decomposable circuit is an element of the product algebra with

\[ P(supp(p)) = 1. \]

The support of a distribution can also be seen as the region of influence of a function. Now, with determinism, we can argue on Mode query calculations

Definition 21 (Determinism)

A sum node $n$ is deterministic if, for any fully-instantiated input, the output of at most one of its children is nonzero

\supp(c_i) \cap \supp(c_j) = \emptyset, \forall c_i, c_j \in ch(n), i \neq j.

A circuit is deterministic if all of its sum nodes are deterministic. (Adapted from [CVVdB20])

Determinism intuitively ensures that the mode of the distribution can be calculated by ensuring that the regions of influence of every part of the circuit are disjoint. If they are disjoint (the circuit is deterministic), we can just get the mode by replacing every sum node with a max operation w.r.t. the likelihood.

The figure below shows a univariate Gaussian mixture model where we take the mode of the children as the mode of the distribution. We can easily see that this is not the real mode, since the regions of influence overlap. In general, whenever having distributions with infinite support in a mixture (sum unit), the mode is not obtained by replacing the sum node with a max operator.

x = Continuous("X")

model = ProbabilisticCircuit()
s1 = SumUnit(probabilistic_circuit = model)
s1.add_subcircuit(leaf(GaussianDistribution(x, 0, 0.5), model), np.log(0.1))
s1.add_subcircuit(leaf(GaussianDistribution(x, 1, 2), model), np.log(0.9))

wrong_mode, wrong_max_likelihood = model.root.subcircuits[1].distribution.mode()
wrong_max_likelihood = model.likelihood(np.array([[wrong_mode.simple_sets[0][x].simple_sets[0].lower]]))[0]
mode_trace = model.univariate_mode_traces(wrong_mode, wrong_max_likelihood)

wrong_mode, wrong_max_likelihood = model.root.subcircuits[0].distribution.mode()
wrong_max_likelihood = model.likelihood(np.array([[wrong_mode.simple_sets[0][x].simple_sets[0].lower]]))[0]
mode_trace += model.univariate_mode_traces(wrong_mode, wrong_max_likelihood)

fig = go.Figure(model.plot(), model.plotly_layout())
fig.add_traces(mode_trace)
fig.show()

The next figure shows that if we truncated the children of the sum node to a disjoint support, we get the correct mode.

model = ProbabilisticCircuit()
s1 = SumUnit(probabilistic_circuit = model)
s1.add_subcircuit(leaf(TruncatedGaussianDistribution(x, open_closed(-np.inf, 0.5).simple_sets[0], 0, 0.5), model), np.log(0.1))
s1.add_subcircuit(leaf(TruncatedGaussianDistribution(x, open(0.5, np.inf).simple_sets[0], 1, 2), model), np.log(0.9))

fig = go.Figure(model.plot(), model.plotly_layout())
fig.show()

Limitations#

While circuits are a great and wonderful tool to represent complex probability distributions, they have their limitations. One of the limitations comes from the sum units. Sum Units can be interpreted as latent variables.

Theorem 6 (Latent Variable Interpretation)

A sum unit $n$ can be interpreted as a latent symbolic variable. The domain of the variable are the children of the sum unit. The probabilities of the sum unit are given by

\[ p(c) = \theta_{c, n}. \]

Since sum units are convex, i. e. $\sum_{c \in ch(n)} \theta_{n,c} = 1$ and $\forall c \in ch(n), \theta_{n,c} \geq 0$, the sum unit can be interpreted as a latent variable. When performing inference, the latent variables are marginalized.

In Theorem 6, we see that the sum unit can be interpreted as a discrete latent variable. Whenever we would use a sum unit with a continuous latent variable, or a discrete latent variable with infinite components, inference would require marginalizing the latent variable which results in intractable integrals.

The second limitation comes from decomposability. A typical approach to modern computational graphs is heavy use of linear algebra. This is not possible with decomposable product units. Consider a linear transformation

\[\begin{split} f(x, y) = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix}. \end{split}\]

The probability distributions would then be expressed over the transformed variables and require us to use a change of variables here.

\[\begin{align*} x_{new} &= x + 2y &y_{new} &= 3x + 4y\\ x &= x_{new} - 2y &y &= -\frac{3}{2}x_{new} + 2y_{new}. \end{align*}\]

When one would perform inference on these, one is still interested in the original variables $x$ and $y$. Performing an integral over the transformed variables would require the use of the Jacobian of the transformation and integrate over it, which is intractable for all interesting transformations.

This behaviour of the probability and the change of variables is the last theorem we will discuss (which I didn’t tell you yet since it is rather complicated).

Theorem 7 (Change of Variables)

Let $X = (X_1, \dots, X_d)$ have a joint density $p_x$. Let $g: \mathbb{R}^d \rightarrow \mathbb{R}^d$ be continuously differentiable and injective, with non vanishing Jacobian $J_g$. Then $Y = g(X)$ has density

\[\begin{align*} p_Y(y) = \begin{cases} p_x(g^{-1}(y) \cdot |J_{g^{-1}}(y)|) & \text{if y is in the range of g} \\ 0 & \text{otherwise} \end{cases} \end{align*}\]

[Hen20]

Normalizing Flows#

Normalizing flows are a way to overcome the limitations of probabilistic circuits at the price of tractable inference. Informally, a normalizing flow consists of an easy, usually fully factorized gaussian latent distribution, and a series of invertible and differentiable transformations that transform the latent distribution into the target distribution.

One can imagine this like a portal from the real world to a different world where things are easier to calculate. However, the portal is not free and requires the calculation of the Jacobian of the transformation.

Hence, integration is not possible in the real world, and the integral approximation Monte Carlo Estimate is used to evaluate the integral.

A common flow is the multivariate gaussian where the transformation is a linear transformation, and the latent distribution is indeed a fully factorized gaussian. This section only points to this idea, and I recommend reading further in here [PNR+21] if you are interested.

[CVVdB20] (1,2,3,4,5,6)

Y Choi, Antonio Vergari, and Guy Van den Broeck. Probabilistic circuits: a unifying framework for tractable probabilistic models. UCLA. URL: http://starai.cs.ucla.edu/papers/ProbCirc20.pdf, 2020.

[GdCP+24]

Gennaro Gala, Cassio de Campos, Robert Peharz, Antonio Vergari, and Erik Quaeghebeur. Probabilistic integral circuits. In International Conference on Artificial Intelligence and Statistics, 2143–2151. PMLR, 2024.

[Hen20]

Philipp Hennig. Probabilistic machine learning. lecture course, University of Tuebingen, 2020. https://uni-tuebingen.de/en/180804.

[Jaw03]

James W. Jawitz. Moments of truncated continuous univariate distributions. Advances in Water Resources, 27:269–281, 2003.

[khttpsmathstackexchangecomusers255727kccu]

kccu (https://math.stackexchange.com/users/255727/kccu). How does one formally define sampling from a probability distribution? Mathematics Stack Exchange. URL:https://math.stackexchange.com/q/2336295 (version: 2017-06-26). URL: https://math.stackexchange.com/q/2336295, arXiv:https://math.stackexchange.com/q/2336295.

[LAB24]

Anji Liu, Kareem Ahmed, and Guy Van den Broeck. Scaling tractable probabilistic circuits: a systems perspective. arXiv preprint arXiv:2406.00766, 2024.

[Mar24]

Pedro Zuidberg Dos Martires. Probabilistic neural circuits. arXiv preprint arXiv:2403.06235, 2024.

[NPSB23]

Daniel Nyga, Mareike Picklum, Tom Schierenbeck, and Michael Beetz. Joint probability trees. arXiv preprint arXiv:2302.07167, 2023.

[Oga22]

Haruhiko Ogasawara. On moments of folded and truncated multivariate normal distributions. Communications In Statistics — Theory and Methods URL: https://www.tandfonline.com/doi/full/10.1080/03610926.2020.1867742, 51(19):6834 – 6862, 2022.

[PNR+21]

George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. The Journal of Machine Learning Research. URL: https://arxiv.org/abs/1912.02762, 22(1):2617–2680, 2021.

[PLV+20]

Robert Peharz, Steven Lang, Antonio Vergari, Karl Stelzner, Alejandro Molina, Martin Trapp, Guy Van den Broeck, Kristian Kersting, and Zoubin Ghahramani. Einsum networks: fast and scalable learning of tractable probabilistic circuits. In International Conference on Machine Learning, 7563–7574. PMLR, 2020.

[PSA19]

Moshe Pollak and Michal Shauly-Aharonov. A double recursion for calculating moments of the truncated normal distribution and its connection to change detection. Methodology and Computing in Applied Probability URL: https://link.springer.com/article/10.1007/s11009-018-9622-7, 21(53):889–906, 2019.

[US20]

UCLA-StarAI. Probabilistic circuits: representations, inference, learning and theory (tutorial at ecml-pkdd 2020). YouTube https://www.youtube.com/watch?v=2RAG5-L9R70, 2020.