Learning univariate, model-free and deterministic distributions#
In this tutorial, I go through the line of thinking, math and usage of learning univariate, model-free and deterministic distribution from raw data.
In most machine learning scenarios, the distribution of the data is unknown. Learning a distribution is then usually done by assuming a functional form and fitting the parameters of the function to the data. This is called parametric learning.
However, we learn a distribution without assuming a functional form. This is called non-parametric learning.
Let \(\mathcal{D}\) be a dataset of \(N\) samples \(\mathcal{D} = \{d_1, d_2, \dots, d_N\}\), where \(x_i \in \mathbb{R}\). The dataset is constructed from a mixture of normal distributions \(p(x) = 0.5 \cdot N(0, 1) + 0.5 \cdot N(5, 0.5)\).
import numpy as np
np.random.seed(69)
dataset = np.concatenate((np.random.normal(0, 1, size=(1000, 1)), np.random.normal(5, 0.5, size=(1000, 1))))
dataset
array([[ 0.9155071 ],
[-0.60354197],
[ 1.16229517],
...,
[ 4.99355308],
[ 5.14493969],
[ 5.50211339]], shape=(2000, 1))
The idea is to create as many components in a mixture of uniform distributions as is necessary to achieve a good fit to the data. The fitness of the model is measured by the weighted log likelihood of the data under the model (compare with Definition 8). The weighted log likelihood is derived from the weighted likelihood. The weighted likelihood of the data under the model is given by
where \(\boldsymbol{\Theta}\) represents the parameters of the model.
By applying the logarithm to the weighted likelihood, we get the log likelihood
When the uniform distribution is assumed for the data, the weighted log likelihood is given by
Since the density of the uniform distribution is constant, the weighted log likelihood can be restated as
which is particularly easy to calculate.
In case of a mixture of two uniform distributions, the likelihood is given by
We will solve this problem using a recursive partitioning scheme. First off, we will sort the data, make it unique. Next, we iterate through every possible partitioning that can be made on the data, such that we get two datasets
The split values are calculated by the distance maximizing value between \(d_k\) and \(d_{k+1}\).
The likelihood of such a split is given by assuming a uniform distribution on the left and right side of the split. This constructs a deterministic mixture of uniform distributions where the log_weights are given by the relative sum of log_weights on the left and right side of the split.
The density of the mixture is given by
The determinism allows simplifying the likelihood calculation for all samples in the left split to just
and for the right split to
Furthermore, we can observe that the density is constant for every sample in the left split and constant for every sample in the right split.
Plugging it in, we get the likelihood for the split
The most likely split is selected, and the process is repeated recursively on the left and right side of the split.
If the likelihood improvement of the best split, compared to no splitting, is smaller than a given threshold, the process is terminated. The likelihood to compare against is given by the weighted log likelihood of a uniform without splitting, i.e.
The best threshold is given by selecting the maximum weighted likelihood improvement over all possible splits. The improvement value is given by
In simpler terms, the induction is terminated as soon as the likelihood does not improve by more than \(\epsilon\,\%\) anymore if a split is made.
This parameter is referenced to as min_likelihood_improvement in the code.
This algorithm is implemented in the NygaDistribution.
Theory Example#
Consider the dataset \(\mathcal{D} = \{ 1, 2, 3, 4, 7, 9 \}\) with uniform log_weights, i.e., \(w_i = 1, \forall i\).
The log likelihood of a uniform distribution without a split over \(\mathcal{D}\) is given by
We can now calculate every possible split and get the list \(splits = [1.5,\, 2.5,\, 3.5,\, 5.5,\, 8]\) Let’s calculate the log likelihood for every split.
Iteration: \(k=1, D_{split} = 1.5\)
Iteration \(k=2, D_{split} = 2.5\)
Iteration \(k=3, D_{split} = 3.5\)
Iteration \(k=4, D_{split} = 5.5\)
Iteration \(k=5, D_{split} = 8\)
We can see that the split at 3.5 has the highest likelihood. But how much better is it? The relative improvement is given by $\( \frac{exp(-12.02) }{exp(-12.48)} = 1.58, \)$ which means that the resulting likelihood is 58% higher than the likelihood without a split.
Practical Example#
Finally, if we apply the algorithm to the dataset, we get the following result. We can see that it looks very similar to the gaussian mixture we sampled from.
from probabilistic_model.learning.nyga_distribution import NygaDistribution
from random_events.variable import Continuous
import plotly
plotly.offline.init_notebook_mode()
import plotly.graph_objects as go
distribution = NygaDistribution(Continuous("x"), min_samples_per_quantile=100, min_likelihood_improvement=0.01)
distribution = distribution.fit(dataset)
fig = go.Figure(distribution.plot(), distribution.plotly_layout())
fig.show()
Comparing this to the Gaussian distribution we sampled from, we can see that the result is very similar.
from probabilistic_model.distributions import GaussianDistribution
from probabilistic_model.probabilistic_circuit.rx.probabilistic_circuit import *
mixture = ProbabilisticCircuit()
gaussian_1 = leaf(GaussianDistribution(Continuous("x"), 0, 1), mixture)
gaussian_2 = leaf(GaussianDistribution(Continuous("x"), 5, 0.5), mixture)
s1 = SumUnit(probabilistic_circuit = mixture)
s1.add_subcircuit(gaussian_1, np.log(0.5))
s1.add_subcircuit(gaussian_2, np.log(0.5))
fig.add_traces(mixture.plot())
In summary, the Nyga Distribution is a universal, univariate, smooth and deterministic distribution approximation without any assumptions on functional form.