Infinitely Wide Neural-Networks | Neural Tangents Explained

use Python and JAX to efficiently build infinitely wide networks for deeper network insights on your finite machine.

What happens when your neural networks stretch into infinity. Image: ESA/Hubble & NASA,

One notorious problem with deep learning and deep neural networks (DNNs) is, that they can become black boxes. Lets say, that we have fitted a network with good test performance on a given classification problem. However we are now stuck. We cannot make sense of the final weights that have been learned or adequately visualize the problem space!
Another issue arises in the real world. In practical applications of applying neural networks we often fall back to train ensembles of networks. We use the averaged output of many models. This can be more powerful than the output of one single network. When we have multiple models we can also better analyze overall performance over the given problem.
Neural Tangents aims to solve these challenges elegantly.
In this post we will explore the background and application of neural networks of infinite width. We take a look at Neural Tangents as a framework and high level API, which allows us to build such networks efficiently. We will use Neural tangents on a simple regression problem and train both a finite and infinite network using JAX. From this we can gain more insights from our model and can provide better analytics. We can use both gradient descent and performing inference, when building networks with Neural Tangents and we will exploit both features when looking at our model.
Firstly, how do we get to infinity and back?

A short primer on Neural Networks

Generally speaking, we build a neural network from assumptions about the underlying problem-space ; for example proximity and values of pixels, to detect edges and coherent elements in pictures. Usually a network’s topology fits to solve a class of problems.
We then initialize the network weights. This can be done from a Gaussian distribution. We train the model by adapting the model-parameters in order to minimize a loss function for the problem that we have previously defined. After the training is done we can test the model, with a selection of metrics on an unseen test-set of the data.
Now we want to bring some light into the dark: We would like easier to explain analytical properties. One essential question is, why (deep) neural networks generalize so well even though they tend to be overparameterized [4].

What happens when we stretch Deep Neural Networks?

It turns out when we increase the width of a regular DNN we can gain some nicer insights and overall better understanding. Very, very, (very) wide Networks show convergence to Gaussian Processes [9]. It had been established for some time, that this is true for single layer neural networks (Ch. 4 in [8]). It has recently been shown that it can also be applicable for deeper networks [9].
What do we get from transforming a Neural Network into a Gaussian Process?
This allows for better analysis and uncertainties over our estimates. In addition to the previous NN metrics we can now perform Bayesian inference, assess uncertainties over prior and posterior, and analyze them.
One essential cornerstone is, that a Gaussian Process is associated with a kernel for its covariance function. These kernels allow for better analytical properties. The mathematics behind kernel functions is in parts better understood and more developed than when working with regular neural networks.
Figure 1 gives a graphical overview of what we are doing. We start with a regular DNN of finite width and take the width of the hidden layer to its limits.

Fig 1) We take a regular deep neural network (A) with one hidden layer as a representation, with an input and output layer. We increase the width of the hidden layer to infinity (B) observing the limits of the neural network. The limiting behavior of the network tends to a Gaussian Process (C). A graphical model serves as a GP representation in the style of Fig.2.3 by Rasmussen [8]. Two kernels (that will be introduced later) NNGP and NTK are proposed to describe the infinite behavior of the network.

Two essential kernels – our gates to infinity

What connects the neural network with the fabled Gaussian Processes?
One essential assumption is, that at initialization (given infinite width) a neural network is equivalent to a Gaussian Process [4]. The evolution that occurs when training the network can then be described by a kernel as has been shown by researchers at the Ecole Polytechnique Federale de Lausanne [4] .

Context on kernels

For the sake of a simple introduction: we say that a kernel is a function that relates two points from your data-set (\mathcal{X}\times \mathcal{X}) to each other. For this fact to work, you need a measurable space that has some underlying properties (looking at you RKHS).
How your pairs of points relate to each other is dependent on a kernel function.
We can use a kernel on our dataset to introduce relations between the data-points, measure similarities, and other properties.
This unnecessarily short paragraph does not do the field and exciting topic of kernel methods justice and I recommend checking out more insightful resources on the subject [5][6].
Now we already have a network, with its architecture, activations and also pre-activations (like convolutions, pooling – you name it) we denote that as z.

The NNGP Kernel is simply defined as:

K_z=\mathbb{E}_\theta [z_i, z_i^T]

and the NTK Kernel:

\Theta_z = \mathbb{E}_\theta [\frac{\delta z_i}{\delta\theta}(\frac{\delta z_i}{\delta \theta})^T]

In layman’s terms, the last kernel describes the change in the network parameters up until z.

The relationship between kernels and tensors

Roughly speaking, there is a direct correspondence between tensor operations and kernel operations. We can therefore express relationships that come from a neural network in terms of kernel methods.
That means, for each (tensor) operation in the neural network we can find an equivalent kernel operation.

There exists a pointwise translational rule, given our two kernels, such that:

(K_y, \Theta_y) = \phi^*(K_z, \Theta_z)\equiv(T(K_z), \bar{T}(K_z))\odot\Theta

and from this we can transform dense layers through our both kernels as:

(K_h, \Theta_h)=\text{Dense}(\sigma_\omega, \sigma_b)*

For a rigorous explanation see Eq.(2)-(7) in [3].

A simple convolutional example

For an convolutional neural network that could translate into:

  1. Layer 0 – input: y^0=X \Leftrightarrow K^0=XX^T and NTK: \Theta^0=0
    This means that the NNGP Kernel operation is simply the given dataset in a NxN matrix and the NTK kernel is zero, since nothing has happened yet in your network.
  2. Layer 0 – convolution: z^0 = Conv(y^0) \Leftrightarrow \bar{K}=\mathcal{A}(K^0) and NTK: \Theta^0=K^0+\mathcal{A}(\Theta^0)
    Here we see the convolution operation by applying the NNGP kernel over the convolutions with the \mathcal{A} operation (see below) and the NTK kernel operates on the computed NNGP Kernel taking previous \Theta^0 into account.
  3. Layer 1 – activations: y^1 = \phi(z^0) \Leftrightarrow K^1=\mathcal{T}(K^0) and NTK: \Theta^1=\bar{T}(K_z))\odot\Theta
    At this stage activation is performed, which is denoted as \mathcal{T} for NNGP. From the NTK perspective the stage of the network is the dot-product of the activation with the state of the \Theta thusfar.
  4. Layer 1 – …

And so on, all the way through the network.

Above \mathcal{A} is a dedicated operation which describes the summation over the convolutional filter. (You can find a complete exemplary overview in Fig. 4 and Table 1 in [3].)
And T is defined as

Using Neural Tangents – hands-on

Now that we are familiar with some of the underlying concepts, we are going to use JAX and the Neural Tangents library. We build an exemplary network and evaluate it in detail.

Enter some regression problem

In order to get going, lets create an arbitrary regression problem from scratch, using sklearn. We create a dataset with two features x_1, x_2 and the function value f, like this:

n_samples = 125
regression = datasets.make_regression(n_samples=n_samples, n_features=2, n_informative=15, noise=0.25)
data = regression[0]
f = regression[1]

Sidenote: Above is only a small number of samples. You are encouraged to try out a bigger problem. Using Google Colab I ran into memory issues when using larger samples (n >750).

From this we get data, that should look something like this:

Fig. 2) Randomly generated regression dataset.

We split our data-set in training and testing for later processing. Be aware that we need an extra axis for our y-values for later computation and that we take a test set of 10% of our total data-set size.

X_train, X_test, y_train, y_test = train_test_split(data, f[:, np.newaxis], test_size=0.1, shuffle=True)
train = (X_train, y_train)
test = (X_test, y_test)

Now we create a simple hidden layer neural network, using JAX. Very simple in fact, since the hidden layer barely holds any neurons (see stax.Dense(5)) together with a relu activation function.

init_fn, apply_fn, kernel_fn = stax.serial(
    stax.Dense(5), stax.Relu(),

Neural tangents returns three objects from this. The kernel_fn part will serve for our analysis over infinity. It is the kernel representation of the architecture that we encountered previously. The init_fn and apply_fn correspond to networks of finite width.

We can visualize before even training a network?

The first interesting feature that neural tangents allows are insights from the kernels.
We compute the kernels for our test-data as such:

kernel = kernel_fn(X_test, X_test, ('nngp', 'ntk')) # get both NNGP and NTK kernel from the test-data
Fig. 3) Prior kernels over our test sample.

This gives us two things.:
One is the performance of Bayesian inference represented through the NNGP kernel at infinite time-steps. We achieve this by looking at our NNGP kernel.
Secondly, the NTK kernel corresponds to how you network would behave after having been trained for an infinite amount of time.
In theory we can sample from the prior distribution. However for this specific problem it is not very informative, so we’ll skip this step. You can find a nice introduction to how prior samples look in Google’s Colab Cookbook.

Actual Inference

After this first insight, we perform inference:

predict_fn = nt.predict.gradient_descent_mse_ensemble(kernel_fn, X_train, 

ntk_mean, ntk_covariance = predict_fn(x_test=X_test, get='ntk', 

ntk_mean = np.reshape(ntk_mean, (-1,))
ntk_std = np.sqrt(np.diag(ntk_covariance))

Where we solve for infinite runtime training over the network, given the kernel that we have constructed. We do this by utilizing the NTK kernel in predict_fn(x_test=X_test, get='ntk', ...
This is called: solving in closed form for gradient descent.
From this we can compute a mean and a covariance matrix. The variables above are ntk_mean and ntk_std . Note that for this specific problem the standard deviation is very small. Though we can still visualize how the means perform:

Fig 4.) Values generated from inference using the NTK kernel (y-axis) against the original data from the test-set (X-axis).

Inference can be done the Bayesian way as well. We just have to substitute ntk with nngp. For the performance of NNGP see the attachment, later in this post.

Computing loss

What we can do with Neural Tangents is compute the loss of the network over timesteps for both testing and training.
The loss is defined as the mean squared error or in Python: 1/n * np.mean(ys**2 - 2*mean*ys + var + mean**2, axis=1)
also we provide the time-steps of interest from 0 to 1000 in steps of 0.1 ( np.arange(0, 10 ** 3, 10 ** -1)):

ts = np.arange(0, 10 ** 3, 10 ** -1)
ntk_train_loss_mean = loss_fn(predict_fn, y_train, ts)
ntk_test_loss_mean = loss_fn(predict_fn, y_test, ts, X_test)
Fig. 5) Loss over time given test and training of our infinite network in log-scale.

The Actual Training

To have a finite comparison point we have to perform actual training of our network. As hyperparameters we set the learning rate to 0.1 and run training for 10000 steps. We use JAX’s stochastic gradient descent for our optimization.:

opt_init, opt_update, get_params = optimizers.sgd(learning_rate)
opt_update = jit(opt_update)

We also need to compute loss and gradient loss in the process of training. As a loss function we use the mean-squared error. Note that we also use the JAX specific jit-compilation feature to make loss and grad-loss more performant in the process.

loss = jit(lambda params, x, y: 1/(len(x))*np.mean((apply_fn(params, x) - y) ** 2))
grad_loss = jit(lambda state, x, y: grad(loss)(get_params(state), x, y))

Now we run the training:

opt_state = opt_init(params)

for i in range(training_steps):
  opt_state = opt_update(i, grad_loss(opt_state, *train), opt_state)

  train_losses += [loss(get_params(opt_state), *train)]  
  test_losses += [loss(get_params(opt_state), *test)]

Our Outcome

Fig 6.) Predictions over test data with underlying truth of the test-data. Good correlation of test data against predictions w.r.t. feature 1, worse performance for feature 2.

Correlation (Pearson) Training feature x_1 r: 0.96
Correlation Training feature x_2 r: -0.03


Correlation (Pearson) Testing feature x_1 r: 0.97
Correlation Testing feature x_2 r: 0.58

Fig. 7) Comparison finite against infinite with computed loss over timesteps in log-scale.

From the last figure, we can see that the finite network performs better after 10 steps. Though please keep in mind that this is a very simple exercise. For more complex tasks [3] shows that Neural Tangents performs at least as well as finite networks.

What makes Neural Tangents performant?

Working with covariances and kernels in a large problem domain can be very expensive. Some of the key challenges are: (1) Computing covariances when (a) inverting matrices, (b) constructing them, (c) updates to them and (2) computing kernels (a) for multiple kernels (b) across all the data.

Neural Tangents offers solutions to make these problems more feasible:

  1. leveraging the internal structures to reduce covariance inversion by orders of magnitude (for classification problems),
  2. tracking only the covariance matrices necessary and optimizing for the properties of the convolutions allowing for a reduced runtime and minimal memory footprint,
  3. treating covariance computations as 2D convolutions to allow for hardware accelerators,
  4. computing the NNGP and NT kernel together, since NTK requires NNGP,
  5. allowing for batching in training such that the dataset does not have to be treated all at once.

This all combined makes computational intense tasks more feasible on standard hardware.


We have seen what benefits can come from using infinitely wide neural nets:

  1. We gained additional insights from the kernel, right after specifying our network architecture,
  2. We computed the infinite width network in closed form,
  3. We gain a mean function and standard deviations from our dataset by doing inference.

Taking NNs to the their limits, ties them well together with Gaussian Processes; a classical machine learning approach. This allows for performing inference, observe how the data-set correlates through the given kernel(s). We have also seen, how Neural Tangents makes this accessible and can give insights and improvements to conventional network structures when building artificial neural networks.
Neural Tangents is an exciting module to allow better insights into deep neural network structures and for more analytical work on the outputs.
Neural Tangents is a high level API that takes a way a lot of the heavy lifting and problems that arise when one takes his network to the limits – with Python and JAX it is even more accessible.

The makers of Neural Tangents have already announced that they are looking into more network structures for future work to come.

For the math and computer science enthusiast I highly recommend checking out the reference publications.

The Code to produce the plots


  1. Fast and Easy Infinitely Wide Networks with Neural Tangents Friday, March 13, 2020 by Samuel S. Schoenholz on the Google AI Blog
  2. Neural Tangents Cookbook in Colab
  4. Neural Tangent Kernel: Convergence and Generalization in Neural Networks by A.Jacot, F.Gabriel, C. Hongler
  5. Kernel Methods in Machine Learning
  6. Kernel Cookbook Advice on Covariance Functions by David Duvenaud
  7. Wide Residual Networks by Sergey ZagoruykoNikos Komodakis
  8. Gaussian Processes for Machine Learning by Carl Edward Rasmussen and Christopher K. I. Williams 2006.
  9. Deep Neural Networks as Gaussian Processes by J. Lee, Y. Bahri, et al. 2018.

Optimizing Hyperparameters the right Way

Efficiently exploring the parameter-search through Bayesian Optimization with skopt in Python. TL;DR: my hyperparameters are always better than yours.

Explore vast canyons of the problem space efficiently – Photo by Fineas Anton on Unsplash

In this post we will build a machine learning pipeline using multiple optimizers and use the power of Bayesian Optimization to arrive at the most optimal configuration for all our parameters. All we need is the sklearn Pipeline and Skopt.
You can use your favorite ML models, as long as they have a sklearn wrapper (looking at you XGBoost or NGBoost).

About Hyperparameters

The critical point for finding the best models that can solve a problem are not just the models. What we need is to find the optimal parameters to make your model work optimally, given the dataset. This is called finding or searching hyperparameters.

For example, we would like to implement a Random Forest in practice and its documentation states:

class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, ...

Then every of those parameters can be explored. This can include all possible number of estimators ( n_estimators ) in the forest from 1 to 10.000, if you would like to split using {“gini”, “entropy”}, or the maximum depth of your trees and many, many more options. Each of those parameters can influence your model‘s performance and worst of all most of the time you do not know the right configuration when you’re starting out with a new problem-set.

The Jack-Hammer aka Grid-Search

The brute-force way to find the optimal configuration is to perform a grid-search for example using sklearn’s GridSearchCV. This means that you try out all possible combinations of parameters on your model.
On the bright side, you might find the desired values. The problem is that the runtime is horrible and over all grid-searching does not scale well. For every new parameter you want to try out, you will test all of the other previously specified parameters also.
After all, this is very uninformative and we have the intuition that some picks of parameters are more informative than others – right?
We don’t need to test for all parameters – especially not those of which we know that they are far off.
One step in the right direction is randomized search like RandomizedSearchCV, where we pick the parameters randomly while moving in the right direction.

Better Bayesian Search

Our tool of choice is BayesSearchCV. This approach uses stepwise Bayesian Optimization to explore the most promising hyperparameters in the problem-space.
Very briefly, Bayesian Optimization finds the minimum to an objective function in large problem-spaces and is very applicable to continuous values. To do this it uses Gaussian Process regression on the objective function under the hood. A thorough mathematical introduction can be found in [2].
The Bayesian Optimization approach gives the benefit that we can give a much larger range of possible values, since over time we automatically explore the most promising regions and discard the not so promising ones.
Plain grid-search would need ages to stupidly explore all possible values.
Since we move much more effectively, we can allow for a much larger playing field.
Let’s look at an example.

The Problemset

Today we use the diabetes dataset from sklearn for ease of use.
This saves us the trouble of loading and cleaning our data and the features are already neatly encoded.

Fig. 1 – Overview over the dataset.

We have a set of encoded columns from age, sex, bmi, blood-pressure and serum values, numerically encoded. Our target value is a measure of the disease progression.
We can see some interactions on the s-columns, indicating correlation for the blood-values.

Fig. 2 – Excerpt from a pairplot examining parameter interactions over all data.

To build our pipeline we first split our dataset into training and testing respectively in a 80:20 split.

X_train, X_test, y_train, y_test = train_test_split(diabetes_df.drop(columns="target"),, test_size=0.2, random_state=21)

Build a Pipeline

We need three elements to build a pipeline: (1) the models to be optimized, (2) the sklearn Pipeline object, and (3) the skopt optimization procedure.

First, we choose two boosting models: AdaBoost and GradientBoosted regressors and for each we define a search space over crucial hyperparameters. Any other regressor from the depth of the sklearn library would do, but boosting might win you the next hackathon (…its 2015 isn’t it?)
The search-space is a dictionary with the key-value pair := { ‘model__parameter‘ :}. For each parameter we set a space from the skopt library in the range that we want. Categorical values are also included by passing them as a list of strings (see Categorical below):

ada_search = {
    'model': [AdaBoostRegressor()],
    'model__learning_rate': Real(0.005, 0.9, prior="log-uniform"),
    'model__n_estimators': Integer(1, 1000),
    'model__loss': Categorical(['linear', 'square', 'exponential'])

gb_search = {
    'model': [GradientBoostingRegressor()],
    'model__learning_rate': Real(0.005, 0.9, prior="log-uniform"),
    'model__n_estimators': Integer(1, 1000),
    'model__loss': Categorical(['ls', 'lad', 'quantile'])

Second, we select over which regression model to pick through another model, this is our pipeline element, where both optimizers (adaboost and gradientboost) come together for selection:

pipe = Pipeline([
                 ('model', GradientBoostingRegressor())

Third, we optimize over our searchspace. For this we invoke the BayesSearchCV. We also specify how the optimizer should call our search-space. In our case that is 100 invokations. Then we fit the pipeline with a simple skopt .fit() command:

opt = BayesSearchCV(
    [(ada_search, 100), (gb_search, 100)],
), y_train)

After the fitting is done, we can then ask for the optimal found parameters. This includes using the score function on the unseen test-data.

Fig. 3 – Evaluating the fitted pipeline

We can see that the validation and test-score. The model with the best results is the AdaBoost model with a linear loss and 259 estimators at a learnin-rate of 0.064. Neat.

From the output we can see that the optimizer uses a GP for optimization under the hood

Fig. 4 – The insights of the optimizer results with a peak into the model.

For illustrative purposes we can also call GP minimization specifically – we’ll use the AdaBoost regressor for that – only focusing on a numerical hyperparameter space.

ada = AdaBoostRegressor(random_state=21)

# numerical space
space  = [Real(0.005, 0.9, "log-uniform", name='learning_rate'),
          Integer(1, 1000, name="n_estimators")]

The major change here is that we have to define an objective function over which we have to optimize. In order to see how the model performs over the parameter-space we use the named arg decorator.

def objective(**params):
    return -np.mean(cross_val_score(ada, X_train, y_train, cv=5, n_jobs=-1,

Therefore our optimization is the negative mean score that we get from the cross-validated fit while using the negative mean absolute error . Other scoring functions might suit you better.
We then use GP minimzation to fit the most optimal parameters for our regressor.

gp_minimize(objective, space, n_calls=100, random_state=21)

Visualize the problem space – post-optimization

Using GP optimization directly allows us to plot convergence over the minimization process.

Fig. 5 – Convergence of GP minimization while finding the optimal hyperparameters of the AdaBoost regressor with respect to the target column in the dataset.

We can see that the min in the function value has already converged after around 40 iterations .
The last excellent feature is visualizing the explored problem space. Since we used only numerical input at this point, we can then evaluate the problem space by plotting it using skopt, like so:

Fig. 6 – Visualizing the invoked learning_rate and n_estimator values over the problem space as histograms. The minimum is marked by the red star in the scatterplot.

From the figure above we can see that a lot of exploration was done on the lower end of our learning-rate spectrum, and the peaks of the tried and tested estimator number was 200 and over 800.
The scatterplot paints a picture of how difficult the problem space is that we are trying to optimize – with the minimum marked with a red star. Though we have reached convergence not a lot of values are evident in the region around the minimum and increasing the number of evaluations can be considered.

Now that we have the best model given our parameter space we can implement that model accordingly and go into its specific analysis.
As you have seen skopt is a library that offers a lot in the realm of optimizations and I encourage you to use them in your daily ML engineering.

Happy Exploring!

The complete code can be found in a Notebook here.


  1. Marc ClaesenBart De Moor. Hyperparameter Search in Machine Learning. 2015 on arXiv.
  2. Peter I. Frazier. A Tutorial on Bayesian Optimization. 2018 on arXiv.
  3. scikit-optimize contributors (BSD License). Scikit-learn hyperparameter search wrapper.

Millennial Suicides – an Analysis about Change

A simple, yet meaningful Pyro model to illustrate change-points over time.

Walking alone. A photo by Mitchel Lensink on Unsplash.

One profound claim and observations by the media is, that the rate of suicides for younger people in the UK have risen from the 1980s to the 2000s. You might find it generally on the news , in publications or it is just an accepted truth by the population. But how can you make this measurable?
In order to make this claim testable we look for data and find an overview of the suicide rates, specifically England and Wales, at the Office for National Statistics (UK) together with an overall visualization.

Making an assumption tangible

Generally, one type of essential questions to ask – in everyday life, business or academia – is when changes have occurred. You are under the assumption that something has fundamentally changed over time. In order to prove that you have to quantify it. So you get some data on the subject-matter and build a model that shows you when changes have occurred as well as their magnitude. We are going to look at exactly how to do that.

Millennial Suicides Specifically

Now that we have our data, we take the average for the age-group of millennials. That entails ages in the range of 25 to 35, for the sake of simplicity. Given our dataset, we can visualize the average suicide count for millennials since the 1980s like this:

suicide_millenials_df = suicide_data[suicide_data.Age.isin(range(23,35))].groupby(["Year"]).mean().reset_index()
# round values to closest integer
suicide_millenials_df["Suicides"] = suicide_millenials_df.Suicides.round()
Yearavrg. Suicide per 100.000
What the head of your dataframe should look like.
Fig. 1 – The average suicides (per 100.000 people) in the age group 25 to 35 for England and Wales from 1981 to 2017.

From this data we can clearly see a steady increase over time until the very early 2000s followed by a decline until the 2010s. The inquisitive side of us wants to know when exactly that happened so that we can go out and look for the root causes and the underlying issues associated with those trends. Lets build a model to find out exactly in which years to look for that substantial change. To look for a specific change-point we will use the Pyro PPL as a probabilistic framework.

The How

We will use the mathematical underpinnings that are described in [2]. This entails:
a uniform prior distribution over the years T , the rate of change \mu from a half-normal distribution (the positive side of the Gaussian) in inversion-bracket notation. Looking at the data we set a large scale for our rate as we want to capture the change from 75 to 130 – something around 50. Lastly we find the observed rate of suicides through a Poisson regression.
Note: The Poisson regression entails, that we will sample from a normal distribution.

T \sim \mathcal{U}(1981,2017), \mu_0\sim\mathcal{N}^+(0, 50), \mu_1\sim\mathcal{N}^+(0, 50), n_t\sim\text{Poisson}(\mu_{[t>T]})

These parameters can be adjusted – maybe you look for different scales of changes, either across time or effect-size. One might consider a half-Cauchy distribution for more extreme values, maybe an increase or decrease on the scale.

The Pyro Model

def model(years, suicides):
    σ = pyro.param('σ', torch.ones(data_len))
    T = pyro.sample('change', dist.Uniform(1981, 2017))
    grp = (years > T) * 1
    # Independent events 
    with pyro.plate('rate', 2):
        μ = pyro.sample('μ', dist.HalfNormal(scale=50))
    y_obs = pyro.sample('obs', dist.Normal(μ[grp], σ), obs=suicides)

The grp gives us the index for our bracketed \mu .
The model observations are then sampled from the Normal distribution taking into account \mu and a scale=\sigma=1 (last line in the model).

Before we can run this model we convert our dataframe to Pyro readable tensors using PyTorch, like so:

years = torch.from_numpy(suicide_millenials_df.Year.values)
suicides = torch.from_numpy(suicide_millenials_df.Suicides.values)
data_len = len(suicide_millenials_df)

For the fitting procedure we perform Bayesian Inference utilizing MCMC methods. We sample for 1000 iterations and let the sampler burn-in up for 300 iterations.

SAMPLES = 1000
WARMUP = 300
nuts_kernel = NUTS(model)
mcmc = MCMC(nuts_kernel, num_samples=SAMPLES, warmup_steps=WARMUP), suicides)

We specifically use the hands-off NUTS sampler to perform inference and find values for our Pyro parameters and recommend that you do the same.
Beware that in contrast to Variational inference approaches the MCMC takes its time. After training we find a relatively high acceptance probability of around 91.2% and a stepsize \epsilon of around 0.00198.
Checking the model summary we see that everything is in order- no divergences – maybe our R-hat values are even too high.

            mean       std    median      5.0%     95.0%     n_eff     r_hat
    change   1995.40      2.10   1995.40   1991.72   1998.24      3.04      1.94
      μ[0]     96.21     10.41     97.56     93.81    101.23     20.16      1.08
      μ[1]     87.62      2.59     87.90     83.01     90.94      3.03      1.95

Now with the samples at hand, we can display when a change-point has occurred most frequently over time:

Fig. 2 – Histogram of the occurrence of the change-point across years.

We go ahead and plot our different rates over the given time

def pl(pt):
    grp = (years > pt['change']) * 1
    line = plt.plot(years, pt['rate0'] * (1 - grp) + pt['rate1'] * grp, 
      color='red', alpha=0.005)    

df.apply(pl, axis=1)
Fig. 3 – Change-rate over time (red-line) for 1000 samples. There is a change around 1995 to 1996 to mark the downwards trend.
Fig. 4 – Another rate of change over time as the red line. We see over three events between 1986 to 1989 the change has manifested itself. This run took 2000 samples.

First we ran the previously described model with a 1000 samples. This uncovered the change-point around the year 1995 to indicate the upcoming downwards trend. The output can be seen in Fig.3 where the rate is indicated as a red-line. Here model has quantified the change-point for the better development.
We took another shot at running the model for which we took 2000 samples and the model converged on a time-range in the years of ’86 to ’89 for yet another change-point. This marks the potential turning-point for an upwards trend. If we want to uncover the cause behind an increase in suicides among Millennials we have to look for reasons in that time-range or earlier changes, of which the effects have manifested themselves into this time.

Open Questions

We have shown that there is not just one change happening over time. There is a range of points for an increase in the rate and a range of points for a decreasing rate. One might extend the model to pick up more than one change-point, making the \mu three dimensional or fitting an entirely different model.

Concluding the Model

We have shown how to get from data, to a concrete formulation of a change in time and their underlying probabilities. This was directly translated in Pyro and through sampling we could infer concrete change-points. Probabilistic Programming is a powerful engine.

The complete notebook can be found here:

The Gravity on the Subject

Though this was intended as an introductory piece on the topic of Data Science, the topic of self-harm and death is a grave one. In case you or a loved one are affected there is help and resources out there. So I will leave you with this message here:


  1. Office for National Statistics, UK. Middle-aged generation most likely to die by suicide and drug poisoning. Dataset on suicide-rates. 2019.
  2. C. Scherrer. Bayesian Changepoint Detection with PyMC3. 2018.
  3. M. Hoffman, A. Gelman. The No-U-Turn Sampler. JMLR v15. 2014.

Improve Your Model with Missing Data | Imputation with NumPyro

Make the best of missing data the Bayesian way. Improve model performance and make comparative benchmarks using Monte Carlo methods.

A Missing frame ready to throw you off your model. Photo by Vilmos Heim on Unsplash.

The Ugly Data

How do you handle missing data, gaps in your data-frames or noisy parameters?
You have spent hours at work, in the lab or in the wild to generate or curate a dataset given an interesting research question or hypothesis. Terribly enough, you find that some of the measurements for a parameter are missing!
Another case that might throw you off is unexpected noise that was introduced at some point in the experiment and has doomed some of your measurements to be extreme outliers.
Sometimes you cannot just exile a whole parameter from your analysis, simply because it has 25% missing entries or maybe entails a “does not apply” field in your questionnaire.

Fear not, there is a way you can make the most of your dataset anyways.
In this post we will use Probabilisitc Programming [5] to fill the gaps that your model might face.

Simulate Missing Data

Lets build a problem from scratch. The studious reader is most likely familiar with the Iris flower data-set. It is somewhat part of the “hello-world” in the ML community; next to the well-known MNIST that is.
We use sklearn.datasets to get the data and put it into a pandas dataframe like so.:

iris_data = datasets.load_iris() 
iris_df = pd.DataFrame(data=np.c_[iris_data['data'], iris_data['target']],
                      columns=iris_data['feature_names'] + ['target'])

To make this example more explicit lets select for only 2 out of the given 3 species.:

iris_df["target"] ="int")
iris_df = iris_df[[0, 1])]

Now you should see 100 measurements (rows that are our n) of the Iris flower data.
We now take out 50% of the petal length entries by chance and assign NaN values with a bit of numpy and pandas magic.

random_vec = np_random(iris_df["petal length (cm)"].shape)<0.5 # 50% marked by numpy random vector on the column
iris_df["petal length (cm)"] = iris_df["petal length (cm)"].where(random_vec, other=np.nan)

Why do we care about all of our measurements?

When we plot our data, we can clearly see that the petal length and petal width can help us identify which species (target-column) we have (see Fig. 1).

Fig. 1: The relation of petal length and width toward the flower species aka. target-column. We can clearly identify two clusters.

We can make an even clearer distinction by regressing on the groups using just the vector which we have decimated earlier.

Fig. 2: The regression of the petal-length parameter on the identified species (target)

Too bad about half of the data is missing. So what do we do now?

Different Shades of Missing

First off, there are different degrees of missing data:

Missing Completely at Random (MCAR) and Missing at Random (MAR).

For a formal introduction to this see Bayesian Data Analysis [1] Ch. 18

One important distinction is that MCAR treats your missing parameters as independent of your observed measurements (lets call them y).
In our example we rely on our parameter being missing at random or MAR. That means we don’t think that the chance of a value gone missing is dependent on the parameter property BUT we think that there is potentially some relation between our length measure and the target species.

A Regression Model to Rule them all

First lets extract some basic info from the data of what we have so far:

>>> print("Petal Length μ = {}".format(iris_df["petal length (cm)"].mean()))
>>> Petal Length μ = 2.8072727272727267
>>> print("Petal Length σ^2 = {}".format(iris_df["petal length (cm)"].std()))
>>> Petal Length σ^2 = 1.4923079993698234

Now we build the model to do the imputation. For this we use the tool NumPyro, which is light-weight probabilistic numpy backend to Pyro – a heavy duty probabilistic language.
This comes in handy after formulating the model. We will use MCMC methods to fit the model and simulate values for missing entries.
To make this more applicable to a use case we combine this with a Bayesian regression-model. That means we want to model the distinction in species given all parameters that we know. Our parameters therefore are:

sepal length := s_len, sepal width := s_width, petal width := p_width and, lastly petal length := p_len, which includes our missing data. 

For each of those parameters we introduce a learnable bias which comes from a regular Normal distribution aka Gaussian for the sake of simplicity. So lets get started.:

def model(s_len, s_width, p_len, p_width, target=None):
    b_s_len = numpyro.sample("b_s_len", dist.Normal(0,1))
    b_s_width = numpyro.sample("b_s_width", dist.Normal(0,1))
    b_p_width = numpyro.sample("b_p_width", dist.Normal(0,1))

One can do a bit of tweaking for the bias parameters above, but for our initial runs the normal Gaussian worked reasonably well.
Now the delicate part. We introduce the distribution which we assume lies under our half-destroyed parameter. As with a lot of things we assume this to be Gaussian, which in our case is a reasonable assumption. Therefore we take the get loc=\mu (2.8 +/- 0.2) and scale=\sigma^2 (1.5+/-0.2) which we had computed earlier:

    len_mu = numpyro.sample("p_length_mu", dist.Normal(2.8, 0.2))
    len_sigma = numpyro.sample("p_length_sigma", dist.Normal(1.5, 0.2))

The informed reader clearly sees how the \mu and \sigma themselves are probabilistic parameters for themselves which we model in the process. For both we assume Gaussians but more rigorous distributions could be chosen.
Now we have to find our targets which we model. Hence we localize our missing positions in the vector with numpy.

    len_is_nan = np.isnan(p_len)
    len_nan_idx = np.array(np.isnan(p_len).astype(int)).nonzero()[0]
    if target is not None:
        len_impute = numpyro.param("len_impute", np.zeros(len_is_nan.sum()))
        len_impute = numpyro.sample("len_impute", dist.Normal(len_mu[len_nan_idx], 

Additionally, we tell NumPyro to make the imputation-value (len_impute) a learnable parameter and fill with as many zeroes as values we are missing.
On a short sidenote: our model has a predictive mode, that occurs when no target (aka species) is provided. When we invoke the predictive capabilities, we sample from the introduced distribution with the sampled \mu and sampled \sigma instead. Make sure to only apply this on the NaN positions in your vector! That is the meaning of the else clause:

numpyro.sample("len_impute", dist.Normal(len_mu[len_nan_idx], len_sigma[len_nan_idx])

We then put the imputed values into the right places and give it the inference sampling, with the observed length values to do its optimization. For this we use the JAX library with some efficient index updating – linear-algebra style.:

   p_len = ops.index_update(p_len, len_nan_idx, len_impute)
   numpyro.sample("p_length", dist.Normal(len_mu, len_sigma), obs=p_len)

In our length-sampling we include the actual length measures as the observations in the process .

Putting Things Together | Imputed Values into the Linear Model

For this model to be a proper linear model, we have to fit our target y (the species).
Since we have two classes, we do this by invoking a Bernoulli distribution.
We therefore ask in a Bernoulli process whether we are looking at y=0 or y=1 – two classes, two potential outcomes.

The Bayesian Regression – Code

We introduce two parameters, one of which is a bias term for the length parameter. We compose the

    l = numpyro.sample("l", dist.Normal(0, 1))
    b_len = numpyro.sample("b_length", dist.Normal(0, 1))
    logits = l + b_len * p_len
    logits = logits + b_s_len*s_len + b_s_width*s_width + b_p_width*p_width

Now to the magical part. We use the logit for the Bernoulli, which composes of all the parameters that we put into the model in interaction with their biases. The observed value in this process is now the target-parameter, our y.

   numpyro.sample("target", dist.Bernoulli(logits=logits), obs=target)

The complete NumPyro imputation model

def model(s_len, s_width, p_len, p_width, target=None):
    b_s_len = numpyro.sample("b_s_len", dist.Normal(0,1))
    b_s_width = numpyro.sample("b_s_width", dist.Normal(0,1))
    b_p_width = numpyro.sample("b_p_width", dist.Normal(0,1))
    # impute length
    len_mu = numpyro.sample("p_length_mu", dist.Normal(2.8, 0.2))
    len_sigma = numpyro.sample("p_length_sigma", dist.Normal(1.5, 0.2))
    len_is_nan = np.isnan(p_len)
    len_nan_idx = np.array(np.isnan(p_len).astype(int)).nonzero()[0]
    if target is not None:
        len_impute = numpyro.param("len_impute", np.zeros(len_is_nan.sum()))
        len_impute = numpyro.sample("len_impute", dist.Normal(len_mu[len_nan_idx], 
    p_len = ops.index_update(p_len, len_nan_idx, len_impute)
    numpyro.sample("p_length", dist.Normal(len_mu, len_sigma), obs=p_len)
    l = numpyro.sample("l", dist.Normal(0, 1))
    b_len = numpyro.sample("b_length", dist.Normal(0, 1))
    logits = l + b_len * p_len
    logits = logits + b_s_len*s_len + b_s_width*s_width + b_p_width*p_width
    if target is None:
        # prediction case
        probs = expit(logits)
        numpyro.sample("probs", dist.Delta(probs))

    numpyro.sample("target", dist.Bernoulli(logits=logits), obs=target)

Run It

To fit the model we use Bayesian Inference utilizing a MCMC method; the NUTS sampler to be specific. The beauty of the NUTS sampler is that it is a hands-off approach, as it adjusts step sizes by itself, given its stepping through the problem-space. We let it have 5000 warm-up or burn-in runs, which are discarded in the process and let it run for 10000 iterations, just to be sure. Our inference finishes with an acceptance probability of 0.96 or 96% – neat!. We then have a look at the summary what has happened:

mcmc = MCMC(NUTS(model=model), 5000, 10000, num_chains=1), **iris_data)

Fig. 3: Model Summary after MCMC indicates convergence and solid R-hat values as metric.

The output clearly shows convergence of the parameters which we sampled in our inference. Specifically the R-hat values are all >= 1.0 .

We then pull the samples and visualize the posterior – we rely on JAX again for the formatting magic:

samples = mcmc.get_samples()
posterior_mu = jnp.expand_dims(samples['l'], -1) + \
               jnp.expand_dims(samples['b_length'], -1)*iris_df["petal length (cm)"].values + \
               jnp.expand_dims(samples['b_s_len'], -1)*iris_df["sepal length (cm)"].values + \
               jnp.expand_dims(samples['b_s_width'], -1)*iris_df["sepal width (cm)"].values + \
               jnp.expand_dims(samples['b_p_width'], -1)*iris_df["petal width (cm)"].values
Fig. 4: Final Bayesian regression over MCMC samples with 90% CI.

There we have it. A clear distinction between our two classes. To be fair there is quite a bit of jitter going on above and below the bounds, but the regression-line could not be nicer.

We have shown that we can fill in the blanks in our model through MCMC-sampling.

For the code-savvy curious reader the final code can be found on my github and in Colab.

What if there are more than two classes, What about Categorical Data?

Great questions, which can be answered with NumPyro’s toolkit as well:

  1. Instead of having a Bernoulli trial in the model you can also use a Multinomial for your Bayesian inference.
  2. We can also use categorical data for model fitting. That looks quite different. Instead of using the one-dimensional array we rely on as many dimensions as there are categories.

A good example to fit with categorical data instead of continuous one is in the excellent kaggle notebook on Age Imputation given other parameters for the Titanic dataset:

The Dos and Don’ts

Please stick to what you have!

Do not invent data. I cannot stress that enough. It is a very fine line that we are walking here.
If your experiments have generated insufficient data or you had to throw some data out, then imputing values, might not be the best way to go. Even when your data might become presentable that way. When talking to a client or an academic journal you don’t get points for: we simulated parameter X and Y to 85.3% – no matter how nice the model looks.
Please do declare what methods, distributions and, assumptions you used to make the models and imputations for maximum transparency.
One recommendation is that you can always take out the data, fit a model without missing measurements and compare it to a model with imputed values in the end. That can make for a compelling story.


We have walked through data imputation, from the data-frame with holes in it to the final Bayesian inference and fitted posterior distribution.
We have shown how your Bayesian regression can benefit from an imputed vector and what the output looks like including confidence intervals.
Probabilistic Programming can be a powerful tool in your daily work.

Best of luck with your models and analysis!


[1] A. Gelman, J.B. Carlin, et. al., Bayesian Data Analysis . Third Edition.
[2] R. McElreath. Statistical Rethinking. 2016. CRC Press.
[3] Uber Technologies. Regression using NumPyro. 2019. NumPyro Documentations.
[4] M. Betancourt. A Conceptual Introduction to Hamiltonian Monte Carlo. 2020 on ArXiv.
[5] CS 4110 – Programming Languages and Logics. Probabilistic Programming. 2016. Cornell University.

The Bayesian Toolkit | 3 Probabilistic Frameworks

Account for uncertainties in your programs and pipelines or build probabilistic models.

The tools to build, train and tune your probabilistic models. Photo by Patryk Grądys on Unsplash.

We should always aim to create better Data Science workflows.
But in order to find out how to achieve that we should find out what is lacking.

Classical ML workflows are missing something

Classical Machine Learning is pipelines work great. The usual workflow looks like this:

  1. Have a use-case or research question
  2. build and curate a dataset that relates to the use-case or research question
  3. build a model
  4. train and validate the model
    1. maybe even cross-validate, while grid-searching hyperparameters.
  5. test the fitted model
  6. deploy model for the use-case or answer the research question

As you might have noticed, one severe shortcoming is to account for certainties of the model and confidence over the output.

Certain about being Uncertain

After going through this workflow and given that the model results looks sensible, we take the output for granted. So what is missing?
In this basic approach we have not accounted for missing or shifted data.
Some might interject and say that they have some augmentation routine for their data. That’s great – but did you formalize it?
What about building a prototype before having seen the data – something like a modeling sanity check? Simulate some data and build a prototype before you invest resources in gathering data and fitting insufficient models. This was already pointed out by Andrew Gelman in his Keynote at the NY PyData Keynote 2017.
Get better intuition and parameter insights! For deep-learning models you need to rely on a platitude of tools and plotting libraries to explain what your model has learned.
For probabilistic approaches you can get insights on parameters quickly.
So what tools do we want to use in a production environment?

I. STAN – The Statisticians Choice

STAN is a well established framework and tool for research. Strictly speaking, this framework has its own probabilistic language and the Stan-code looks more like a statistical formulation of the the model you are fitting.
Once you have built and done inference with your model you save everything to file, which brings the great advantage that everything is reproducible.
STAN is well supported in R through RStan, Python with PyStan, and other interfaces.
In the background the framework compiles the model into efficient C++ code.
In the end, the computation is done through MCMC Inference (e.g. NUTS sampler) which is easily accessible and even Variational Inference is supported.
If you want to get started with this Bayesian approach we recommend the case-studies.

II. Pyro – The Programming Approach

My personal favorite tool for deep probabilistic models is Pyro. This language was developed and is maintained by the Uber Engineering division. The framework is backed by PyTorch so that the modeling that you are doing integrates seamlessly with the PyTorch models which you might already have.
Writing your models and training writes like any other Python code with some special rules and formulations that come with the probabilistic approach.

As an overview we have already compared STAN and Pyro Modeling on a small problem-set in a previous post: .

Pyro excels, when you want to find randomly distributed parameters, sample data and perform efficient inference.
As this language is under constant development not everything you are working on might be documented. There are a lot of use-cases and already existing model-implementations and examples. Also the documentation gets better by the day.
The examples and tutorials are a good place to start, especially when you are new to the field of probabilistic Programming and statistical modeling.

III. TensorFlow Probability – Google’s Favorite

When you talk Machine Learning, especially deep learning, many people think TensorFlow. Since TensorFlow is backed by Google developers you can be certain, that it is well maintained and has excellent documentation.
When you have TensorFlow or better yet TF2 in your workflows already you are all set to use TF Probability also.
Josh Dillon made an excellent case why probabilistic modeling is worth the learning curve and why you should consider TensorFlow Probability at the Tensorflow Dev Summit 2019:

TensorFlow Probability: Learning with confidence (TF Dev Summit ’19) by TensorFlow Channel

And here is a short Notebook to get you started on writing Tensorflow Probability Models:

Honorable Mentions

PyMC3 is an openly available python probabilistic modeling API. It has vast application in research, has great community support and you can find a number of talks on probabilistic modeling on youtube to get you started.

If you are programming Julia, take a look at Gen. This is also openly available and in very early stages. So documentation is still lacking and things might break. Anyhow it appears to be an exciting framework. If you are happy to experiment, the publications and talks so far have been very promising.


[1] Paul-Christian Bürkner. brms: An R Package for Bayesian Multilevel Models Using Stan
[2] B. Carpenter, A. Gelman, et al. STAN: A Probabilistic Programming Language
[3] E. Bingham, J. Chen, et al. Pyro: Deep Universal Probabilistic Programming

Single-Parameter Models | Pyro vs. STAN

Photo by Joey Csunyo on Unsplash

Modeling U.S. cancer-death rates with two Bayesian approaches: MCMC in STAN and SVI in Pyro.

Single parameter models are an excellent way to get started with the topic of probabilistic modeling. These models comprise of one parameter that influences our observation and which we can infer from the given data. In this article we look at the performance and compare two well established frameworks – the Statistical Language STAN and the Pyro Probabilistic Programming Language.

Kidney Cancer Data

One old and established dataset is the cases of kidney cancer in the U.S. from 1980-1989, which is available here (see [1]). Given are U.S. counties, their total population and the cases of reported cancer-deaths.
Our task is to infer the rate of death from the given data in a Bayesian way.
An elaborate walk-through of the task can be found in section 2.8 of “Bayesian Data Analysis 3” [1].
Our dataframe looks like this:

We have the total number of datapoints N (all rows in the dataframe) and per county we have the observed death-count (dc) which we refer to as y and the population (pop), which we call n later.
Given that we have epidemiological data, we think that the Poisson distribution gives a good basis for estimating the rate that we want to compute. Therefore our observations y are sampled from a Poisson distribution. The interesting parameter is the \lambda for our Poisson, which we call rate. This death-rate comes from a Gamma distribution (see explanations in Chapter 2 [1]). A very short explanation why we use a Gamma here is that it is a conjugate prior with respect to Poisson.
Though there is more to the special relationship that the distributions have.

Putting it together we arrive at the formulation for the observations of death-cases:

and for the wanted single parameter rate:

Now that we have seen the tasks lets fire up our tools and get to work.

STAN – the classical statistical models

The statistician’s well established work-horse is the stats language STAN.
You write your problem as model-code and STAN will compile an efficient C++ model under the hood. From a compiled model you can sample and perform inference.
The Framework offers different interfaces for example in Python (PyStan), R (RStan), Julia and others. For this article we will use PyStan so that we can have both models neatly side by side in a notebook.
We start to define the data that we have given in STAN. This means the integer N for the total size of the data set, y the observed deaths and n the population per county. So that we have each N times.

data {
   int N; // total observations
   int y[N]; // observed death-count
   vector[N] n; // population

We already know that a rate parameter like the death-rate \theta is bounded in [0, 1]. That means you cannot have a rate less than 0 or higher than 1.
This \theta parameter is then basis for the actual rate – that accounts for a counties population. Thus, since we transform our parameter we put it into the transformed parameters bracket.:

parameters {
   vector<lower=0,upper=1>[N] theta; // bounded deathrate estimate

transformed parameters {
   vector[N] rate=n .* theta;

Now to the magical inference part, which is what the posterior is all about. We sample our \theta from a Gamma distribution. The \alpha (=20) and \beta (=430000) are given in [1], but one could easily compute them from the underlying dataset. Notice, that the model actually takes the transformed parameter and not just the \theta. This will become apparent in the output later.

model {
   theta ~ gamma(20,430000);
   y ~ poisson(rate);

Now that we have a STAN Model as a string, we need the data in the right format. That means arrays of integers from the data, like so:

dc = data["dc"].to_numpy().astype(int)
pop = data['pop'].to_numpy().astype(int)
kidney_cancer_dat = {'N': N,
                    'y': dc[0:N],
                    'n': pop[0:N]}

Since everything is set-up we can use PyStan to convert our Model to C++ code and use sampling to perform inference. The inference procedure is Markov Chain Monte Carlo or MCMC. The necessary parameters are the amount of samples that we take (the iterations) and the amount of chains over which we draw the samples. Each chain is an ordered sequence of draws.

sm = pystan.StanModel(model_code=kidney_cancer_code)
fit = sm.sampling(data=kidney_cancer_dat, iter=5000, chains=4)

After the fitting is done, the first thing to check is if the model has converged. That means that \hat{R} is >= 1 or we can just ask the diagnostics tools:


### the below lines are output:
n_eff / iter looks reasonable for all parameters
Rhat looks reasonable for all parameters
0.0 of 10000 iterations ended with a divergence (0.0%)
0 of 10000 iterations saturated the maximum tree depth of 10 (0.0%)
E-BFMI indicated no pathological behavior

We can also see that the chains have converged (right plots) from the traces of the performed inference:

Fig. 1 – Traceplot of the \theta posterior computation. The chains on the right side display convergence. Each chain has a density function for theta (left side), which display agreement over the values.

The chains (Fig. 1 right) should look like “caterpillars madly in love” and your diagnostics look good. So we know that our model has converged. We can now see the inferred theta values and can look at the individual densities:

az.plot_density(fit, var_names=["theta"])
Fig. 2 – Posterior densities on the fitted single parameter \theta after inference.
Warning: Do not use the complete dataset as input! This will lead to errors,
we went with N=25 samples for testing purposes. Feel free to test how many samples it takes to break PyStan (or STAN).

For doing Bayesian regression task there are also different modules that let you skip the “write-an-elaborate-statistical-model” part. BRMS in R is one of those excellent tools.

Pyro – the probabilistic programming approach

Going to a completely different paradigm is Pyro. Instead of performing MCMC through sampling we treat our task as an optimization problem.
For this we formulate a model, which computes our posterior (p). Additionally we also formulate a so-called guide which gives us a parameterized distribution (q), which is used for the model-fitting procedure.
In a previous SVI article we had already looked at how this optimization works and recommend a short recap.

The highly unintuitive part to this is that instead of simply fitting one parameter we comprise it of four Pyro parameters.:

  1. Firstly, \alpha and \beta are constants that give our Gamma distribution its shape like we did in the STAN part,
  2. We add two boring trainable parameters p1 and p2, which get allow for optimization in the SVI steps. Both parameters are positive – hence constraint=constraints.positive ,
  3. When going through the data, the observations themselves are independent events.
    This is done through Pyro’s plate modeling, which also supports subsampling from our dataset. A nice introduction to this can be found in [2].
ϵ = 10e-3
def model(population, deathcount):
    α = pyro.param('α', torch.tensor(20.))
    β = pyro.param('β', torch.tensor(430000.))
    p1= pyro.param('p1', torch.ones(data.shape[0]), constraint=constraints.positive)
    p2 = pyro.param('p2', torch.ones(data.shape[0]), constraint=constraints.positive)
    with pyro.plate('data', data.shape[0], subsample_size=32) as idx:
        n_j = population[idx]
        y_j = deathcount[idx]
        α_j = α + p1[idx]*y_j
        β_j = β + p2[idx]*n_j
        θ_j = pyro.sample("θ", Gamma(α_j, β_j))
        λ_j = 10*n_j*θ_j + ϵ
        pyro.sample('obs', Poisson(λ_j), obs=y_j)

The model performs the sampling of the observations given the Poisson distribution. This is comparable to what STAN was doing, with the difference that the parameters which make up \lambda and the underlying distributions are now trainable Pyro parameters.
In order for our model to not go up in flames we have to add a small number \epsilon to the rate otherwise Poisson will be unstable:

λ_j = 10*n_j*θ_j + ϵ
This is not the best way to model this task, but it is closest to the STAN model. One can find a model and guide that do not rely on y and n for computing alpha and \beta.

Now the guide gives us a parameterized distribution q with which we can perform optimization on minimizing the Evidence Lower bound or ELBO. The parameters are trainable Pyro parameters, which we have already seen in the model.

def guide(population, deathcount):
    α = pyro.param('α', torch.tensor(20.))
    β = pyro.param('β', torch.tensor(430000.))
    p1 = pyro.param('p1', torch.ones(data.shape[0]), constraint=constraints.positive)
    p2 = pyro.param('p2', torch.ones(data.shape[0]), constraint=constraints.positive)
    with pyro.plate('data', data.shape[0], subsample_size=32) as idx:
        n_j = population[idx]
        y_j = deathcount[idx]
        α_j = α + p1[idx]*y_j
        β_j = β + p2[idx]*n_j
        θ_j = pyro.sample("θ", Gamma(α_j, β_j))

Now we can run our SVI like so:

svi = SVI(model, guide, Adam({'lr': 0.025}), JitTrace_ELBO(3))
for i in tqdm(range(1000)):
    loss = svi.step(population, deathcount)
    print(f"Loss: {loss}")

When we plot the loss, we can see that the model improves over time. The lower the bound the better. The loss is the ELBO trace over the iterations.

Fig. 3 – SVI loss over iteration. Loss is the ELBO returned from the SVI step over 1000 iterations.

Now for the final check we can see that we have fitted appropriate parameters.
The \alpha and \beta variables make up for a good underlying Gamma distribution for our posterior inference.

Conclusion – What approach is best?

STAN has a straight-forward way to reason about the model. If the mathematical formulation of a posterior is known, it can be very straight forward to implement a model.
However STAN and its MCMC sampling have their limitations. Under default configuration it was not possible to run our model over all of the data.
Pyro does an excellent job at handling larger datasets efficiently and performing Variational Inference. As a Probabilistic Programming Language it can be written just like any other Python code. The model plus guide are also informative and encapsulate the problem well. Through this we transform the posterior computation into an optimization task and get sensible outputs.

Now that you are familiar with single-parameter models have fun working with bigger, more complex tasks. The Pyro Examples are a good place to start.

Happy inferring!


[1] A. Gelman, J.B. Carlin, et. al., Bayesian Data Analysis . Third Edition.
[2] Pyro Documentation – SVI part II

Your 10.000 hours in Data Science | A Walk down Data-Lane

The way from data novice to professional.

Clock with reverse numeral, Jewish Townhall Clock, Prague- by Richard Michael (all rights reserved).

There exists the idea that practicing something for over 10000 h (ten-thousand-hours) lets you acquire enough proficiency with the subject. The concept is based on the book Outliers by M. Gladwell. The mentioned 10k hours are how much time you spend practicing or studying a subject until you have a firm grasp and can be called proficient. Though this amount of hours is somewhat arbitrary, we will take a look on how those many hours can be spent to gain proficiency in the field of Data Science.
Imagine this as a learning budget in your Data-apprenticeship journey. If I were to start from scratch, this is how I would spend those 10 thousand hours to become a proficient Data Scientist.

The Core modules

Mathematics and Statistics: Basic (frequentist) statistics, Bayesian statistics, Intro to Data Analysis, some Probability Theory, some basic Analysis and Calculus.
A Data Scientists main job is to provide insights into data from problem domains. To have a firm grasp of the underlying mathematics is essential. You can find a lot of excellent content from available university courses and Coursera courses. This is a good way to get started on your journey and get a rudimentary understanding of how statistics work. Though, fundamental courses are essential, please challenge yourself with the real deal.
Lets give that 2500h (312 days of 8h work).
Analysis/Stats Languages and modules: R, Python (pandas), Julia, SAS.
This is the bread-and-butter for you and can be considered a part of your stats-education. Here it is about learning by doing. Reading a O’Reilly book on R will only get you so far. Load up a curated data-set, some kaggle challenge and work through some problems.
There is a spectrum on which statistical languages lie. So if you’re close to scientific computing you might consider Julia. The other end of the spectrum with just statistical analysis is SAS. Some people argue that R can do both.
1000h (125 days of 8h work)

Multi-purpose programming languages: Python, Go, Java, Bash, Perl, C++,… .
This very much depends on the systems that you are facing on a daily basis. If you just start out, pick a language and stick to it. Once you learn the concept you will pick up different languages easier.
I for myself rely heavily on a combination of Python and Bash for my daily work. Other tasks require a thorough understanding of the good old Java or even Perl to get started.
2000h (250 days of 8h work).

Database Technologies: {T-, My-, U-}SQL, PostgreSQL, MongoDB, … .
Relational or non-relational databases are some of the systems that you will have to work with in a production environment. It is useful to know, how your data is stored, how queries run under the hood, how to reverse a transaction. For later work your data-sources might vary a lot and it is good to have a basic understanding.
750h (94 days of 8h work)

Operating Systems: Windows, Linux, MacOS.
Whatever your work-environment is: master it! Know the ins-and outs. You might need to install some really weird library this one time to solve a specific problem (true story). Knowing where things are and why they are there goes a long way. Know what an SSH connection is and how to run analysis on a different system. Running analysis not on your local machine is going to happen at some point in the future.
500h (62 days of 8h work) .

ETL/ELT Systems: This is a mixture between programming languages, database technologies and operating systems. Frameworks like Spark, Hadoop, Hive offer a advanced means for data-storage and also analysis on cluster computing platforms. Some companies may rely on a different tech-stacks like Microsoft Azure Systems, Google Cloud or AWS solutions.
This goes hand in hand with Database Technologies and might already require a firm understanding of higher level programming languages (like Java or Python). There are also beginner systems, like KNIME to get your feet wet .
400 (50 days of 8h)

Your Problem Cases: This may be Fin-Tech, Bio-Tech, Business Analytics
You should be familiar with the problem domain in which you are working. Is it research and development that you are doing? Are you visualizing business processes and customer behavior? Different field require different insights. Your data has to make sense to you and you should be able to see if a model-output is in a valid range or totally off. Spend a very absolute minimum of 350h in your problem domain. We suggest more. A lot more.
You can decide if you want to be the method jack-of-all-trades or the expert in your field.
> 350h (44 days of 8h)

The attentive reader sees that this only adds up to 7500 hours so far. We have a basis now and you might want to go into a certain direction from here.

Different Data-trades – Your personal direction

Back in the olden days, the dark-unenlightened ages, we had guilds. The field of working with data also has different guilds. Types of tradesmen that solve different problems and there have been different articles on the subject.
Here is how those trades differ in their requirements:

  1. The Data Scientist: You do stats and analysis, you roll-out solutions and deploy platforms that answer questions and provide insights. Neat. Up your statistics game: + 200h. Maybe a bit of ML which we call statistical learning +100h and some type of multi-purpose-language like Python, Perl or Bash for your dayjob.
  2. The Data Engineer: You care about the data. You build the systems that enable the Data Scientist to work effectively and give the analysts their daily bread. Your focus lies on the underlying systems. +500h on Systems, +200h on ETL/ELT systems, +200h on DBs and programming languages +200h. You can even afford to not be so heavily involved in the statistics part (-500h on pure stats and theory).
  3. The Machine Learning Engineer: You implement the models that make the magic happen.
    You need all the statistical insights and also proficiency with all the ML on top of things –
    +300h on ML, + 200h on Stats. You should also be well versed in a high level programming language which makes implementing ML-models possible in the first place.
  4. The Analyst: You take data from the systems, make it pretty and reportable for management. You need to know what matters in your problem domain +200h in problem cases, +100h in DB systems. You need your SQL daily. Though there is a big variety of analyst jobs out there. So inform yourself and continue learning what you see is necessary.
Fig. 1 – Time and experience from a Data Scientist compared to an analyst’s job..
Fig. 2 – Time and experience of an ML-Engineer compared to a Data-Engineer.
Math/Stats = Mathematics and Statistics, includes your work with stat-languages like R.
Systems = Setting up and maintaining systems (e.g. ETL/ELT)
DBs = Database Systems (relational, non-relational)
SWE = Software Engineering (Python, Perl, Java, Scala, Go, JavaScript, …)
Exp. Field = Experience in the problem domain. Hours spent on the topic.

The Figures above illustrate exemplary careers and how much time and effort can be placed in each domain. The figures make suggestions and vary from individual to individual. For example the Data-Engineer in our example could also be called a Software Engineer that handles a lot of data and another engineer would need more experience working with databases compared to programming languages.

As with everything in life the lines are blurred. Sometimes a Data Scientist is also does the work of a ML Engineer and vice versa. Sometimes a Data Scientist also has to do Analytics Dashboards. Stranger things have happened in the past. Keep in mind that the 10k budget is your entry and gives you a direction into the field and not your specialization. So you have all the freedom in the world to eventually do that PhD in Parapsychology that you’ve always wanted.

How much work is ten thousand hours

10k hours is a long time. if you were to work for one year that would be 27h per day and not so feasible for the average mortal.
So better 3 years of work for around 9 h might be better suited.
If you’re studying on the side and look for a change in your job at some point in the future, then you might be set after a couple more years. The important part is to be consistent with it. A couple of quality hours a day over a few years will get you a long way.

Personal Note

My journey started in October 2016 coming from systems administration and starting a new undergraduate program. Now that it is July 2020 it has been around 1360 days of gaining experience in the field. Therefore I am in the field of >7500 h (with over 5.5h every day) and still continue to study and learn a lot every day. My undergraduate program in Cognitive Science helped me get a head-start in the fields of computer science, applied mathematics, Bayesian statistics, DB systems, formal logic, Machine Learning courses and much more. This together with working as a Data Engineer part-time helped me gain a lot of insights really quickly.

The Key element

I don’t care if your do your 10 thousand hours at Uni, in school, at seminars or at home.
Nobody cares. You have to show your work. A degree, if done right, can show this – but it doesn’t have to. It is about gaining experience in the field by solving problems and gaining proficiency with concepts.
Solve a problem and put it in a portfolio, your resume or simply your github-repository.
Whatever works best for you. People hire for your problem solving skills. Not for the A that you got on that one project 2 years ago. You might say that it was a really cool project altogether.
That’s great! if you made a dent in the universe you should tell people about it anyways.

The journey is a long one and if you enjoy what you do it is worthwhile. You will learn a lot along the way.

A short Disclaimer: Always keep in mind that there are different fields of Data Science related work out in the wild.
Every job has its specific purpose and requires a different tool-sets.
Every employer might look for different sets of skills.

Compute the Incomputable | What SVI and ELBO do

One reason why Bayesian Modeling works with real world data. The approximate light-house in the sea of randomness.

Photo by William Bout on Unsplash

When you want to gain more insights into your data you rely on programming frameworks that allow you to interact with probabilities. All you have in the beginning is a collection of data-points. It is just a glimpse into the underlying distribution from which your data comes. However, you do not just want to get simple data-points out in the end. What you want is elaborate, talkative density distributions with which you can perform tests. For this you use probabilistic frameworks like TensorFlow Probability, Pyro or STAN to compute posteriors of probabilities.
As we will see, the plain computation is not always feasible and we rely on Markov Chain Monte Carlo (MCMC) methods or Stochastic Variational Inference (SVI) to solve those problems. Especially over large data-sets or even every-day medium sized datasets we have to perform the magic of sampling and inference to compute values and fit models. If these methods would not exist we would be stuck with a neat model that we had thought up, but no way to know if it actually makes sense.
We have built such a model in a previous article on forecasting .
In this example we will look at inference and sampling from a Pyro point of view.

In short, instead of brute-forcing the computation Pyro makes an estimate. We can sample from that estimate and we can then compare that estimate to the actual data given our model. Then we make improvements – step by step. But why can’t we just compute the whole thing? I’m glad you asked.

From infinity to constant

Lets say we have a dataset D. We want to build a model (\theta) that accurately describes our data. For that we need a set of variables. These variables correspond to events that are random and we call them latent random variables e.g. z.
Now to see how well our model does we rely on posterior probability distribution P_{\theta}. If this is close to the real deal the likelihood of this posterior, compared to the observed data will be low. That is a good starting point, now how do I compute this P? In order for this to work, we have to do inference like this:


The math-savvy reader already sees that the right part of the equation above is in general not computable. We face the terror of the almighty integral over something we don’t know yet and given a continuous variable might be infinite. That is quite big.
What we can do instead is use a variational distribution; lets call it Q.
In an ideal world this approximate Q would be as close as possible to our P which we are trying to compute. So how do we get there?
One thing that we need first is a measure of how similar P and Q are. Enter the Kullback Leibler Divergence — you can find the legendary paper by Kullback and Leibler in the Annals of Mathematical Statistics [1].
Our tool to compare the distributions, the Kullback Leibler Divergence is
defined as :

    \begin{align*}KL(Q(z|x)||P(z|x)) = E_q\log\frac{Q(z|x)}{P(z|x)}\end{align}

This measure tells us how similar the distributions P and Q are. If they are similar, the KL is low and if they are very different, the divergence is large.

Now this measure requires us to know both distributions P and Q. However we don’t know P yet. After all, that is what we are computing and so far our Q is a wild guess.
After some reformulation we end up with this (see [5] for a step-by-step solution):

    \begin{align*}KL(Q(z|x)||P(z|x))=-E_q[\log P(z,x)-\log Q(z|x)] \\ + \log P(x)\end{align}

The first part: -E_q[\log P(z,x)-\log Q(z|x)] is what is called the evidence lower bound or ELBO. The second part: \log P(x) is a constant. That is much better. All we have to do now is maximize the ELBO to get minimal divergence between our P and Q.

Minimizing KL divergece corresponds to maximizing ELBO and vice versa.

Learn Q with Deep-Learning.

Now that we have the basic theory laid out, what do we do about our Q?
We can learn this estimate by using some form of neural network. We already have TensorFlow or PyTorch in the case of Pyro at our disposal. Compared to the model we can build something that models our variational distribution. As soon as we have that we can use stochastic gradient descent to optimize and get our model. So lets fire away.

We use our data as input and get Q as an output from the network. If we have this as a Gaussian then the network gives us a mean \mu and standard deviation \sigma.

If we picture this process in a very simple form, rolling out over time it looks something like this:

Fig. 1 – Function of model and guide when computing the posterior.

Over all our data we then observe various random events which we compare to what we have sampled at that point in time. We try to minimize the divergence – which means to maximize the evidence lower bound. At the last point in time the ELBO is maximized or the divergence is minimal. We then have fitted our model.

Talk Pyro to me

Lets take a simple example case in which we use a fit a model-function f, given the latent random variable z which comes from a \beta-distribution. The events we are observing are happening Bernoulli-random. This means that the event either occurs or it doesn’t. Lastly the events are i.i.d. – that is independent and identically distributed which we write as the `pyro.plate` notation. So that our model is:

 def model(self):
        # sample `z` from the beta prior
        f = pyro.sample("z", dist.Beta(10, 10))
        # plate notion for conditionally independent events given f
        with pyro.plate("data"):
            # observe all datapoints using the bernoulli distribution
            pyro.sample("obs", dist.Bernoulli(f),

So far this is standard Pyro and in our forecasting model we did not specify a guide even though it was still there. You can take full control of your model process, by specifying the guide as such. In most cases this just looks like your model only slightly different. In this guide the latent z comes from this:

 def guide(self):
        # register the two variational parameters with pyro
        alpha_q = pyro.param("alpha_q", torch.tensor(15),
        beta_q = pyro.param("beta_q", torch.tensor(15),
        pyro.sample("latent_z", NonreparameterizedBeta(alpha_q, beta_q))

Lastly we run our inference by using the SVI object. We provide the previously specified model and the specified guide. Also we rely on the standard Adam optimizer for stochastic gradient descent.

def inference(self):
        # clear everything that might still be around
        # setup the optimizer and the inference algorithm
        optimizer = optim.Adam({"lr": .0005, "betas": (0.93, 0.999)})
        svi = SVI(self.model,, optimizer, loss=TraceGraph_ELBO())

You’re all set. Now lets look at some output. For this to take some shape I have used a simple Bayesian regression task and compute the posterior for the model-function (see my github for the specifics).
We utilize Pyro’s SVI but also a Markov Chain Monte Carlo procedure with a NUTS sampler.

The Ugly Truth | Comparing SVI to MCMC

So far everything sounded very promising – almost too good to be true. With everything in life you should always check your results. This example below (Fig. 2-3) shows the fitted posterior distribution of the previously mentioned regression-task using SVI and MCMC side by side.

Fig. 2 – Bayesian Regression of a function f with two variables using SVI and MCMC.
Fig. 3 – Density distributions overlay of a Bayesian regression task with posterior distribution after model-fitting.

The in-depth understanding of how and why Markov methods and inference differ is topic of various research. As this is a deep-dive it will be the topic for another article. The only thing that we can conclude at this point is that the sampling routine has an evident impact on our final model.

What do I do with SVI

We have looked a seemingly unfeasible problem in the face and taken a look at the theory of the underlying the inference process to figure out why this works. You can use deep-learning frameworks like TF and PyTorch to build your probabilistic models by approximating another model-distribution and use MCMC or SVI for this. The goal of it all is to minimize divergence (or maximize the lower bound) between our two approximated posteriors.
We should always look at our output and ask ourselves if what we see makes sense. The sampling procedure apparently influences our posterior outcome – so we should be careful.
All in all SVI allows us to take any data-set and perform Bayesian Modeling with it. If you ask me, that is a pretty nice feature.


  1. S. Kullback and R.Leibler. On Information and Sufficiency. 1951 in Annals of Mathematical Statistics.
  2. D. Wingate and T. Weber. Automated Variational Inference in Probabilistic Programming. 2013 on arxiv.
  3. Pyro Documentation – SVI tutorial I
  4. Pyro Documentation – SVI tutorial II
  5. Pyro Documentation – SVI tutorial III
  6. Reza Babanezhad. Stochastic Variational Inference. UBC
  7. D. Madras Tutorial Stochastic Variational Inference. University of Toronto. 2017

The Complete Pyro Example

The metioned example simply serves as a rough overview of how a Pyro class is set up. The functional code can be found in the Pyro documentation (see [5]).

class BetaExample:
    def __init__(self, data):
        # ... specify your model needs here ...
        # the dataset needs to be a torch vector = data

    def model(self):
        # sample `z` from the beta prior
        f = pyro.sample("z", dist.Beta(10, 10))
        # plate notion for conditionally independent events given f
        with pyro.plate("data"):
            # observe all datapoints using the bernoulli likelihood
            pyro.sample("obs", dist.Bernoulli(f),

    def guide(self):
        # register the two variational parameters with pyro
        alpha_q = pyro.param("alpha_q", torch.tensor(15),
        beta_q = pyro.param("beta_q", torch.tensor(15),
        pyro.sample("latent_z", NonreparameterizedBeta(alpha_q, beta_q))

    def inference(self):
        # clear the param store
        # setup the optimizer and the inference algorithm
        optimizer = optim.Adam({"lr": .0005, "betas": (0.93, 0.999)})
        svi = SVI(self.model,, optimizer, loss=TraceGraph_ELBO())

How your model is optimized | Know your Optimization

The answer to: “Why is my model running forever?” or the classic: “I think it might have converged?”

The driver behind a lot of models that the average Data Scientist or ML-engineer uses daily, relies on numerical optimization methods. Studying the optimization and performance on different functions helps to gain a better understanding of how the process works.
The challenge we face on a daily basis is that someone gives us a model of how they think the world or their problem works. Now, you as a Data Scientist have to find the optimal solution to the problem. For example you look at an energy-function and want to find the absolute, global minimum for your tool to work or your protein to be stable. Maybe you have modeled user-data and you want to find the ideal customer given all your input-features — hopefully continuous ones.
Here is a comprehensive summary of how an optimizer works under the hood:

Benchmark Functions – the curved, the bent and the asymmetric

Before we meet our optimizers we have to get acquainted with the functions that we intend to optimize. My three personal favorites are:

  1. Mr. Nice-Guy is the ellipsoid function. It is overall nicely behaved, convex and has a global minimum.
    it is defined as f_1(x)=\sum_{i=1}^d\alpha\frac{i-1}{d-1}(x_i)^2
  2. The Rosenbrock or also banana-function. It is not as nice as our f_1 – through its valley of-death it can be a challenge to optimization algorithms. It is defined as f_2(x)=(1-x_1)^2+100*(x_2-x_1^2)^2.
  3. The twisted-one is our attractive-sector function. It is the most unusual of the selection as it is non-convex and asymmetric around the minimum. It is defined as f_3(x)=\sum_{i=1}^d h(x_i)^2+100(-x_i)^2 where h(x)=\frac{\log(1+\exp(qx))}{q}.

This is comprehensible selection of functions and there exists a multitude of available benchmark-functions to run your optimizer across. From the depths of the Styblinski-Tang function to the beautiful peeks of Rastrigin. Some of which are ready made and implemented like in [1].

As you might have noticed the functions work on R^n and we provide an input vector \textbf{x}\in R^n. We will only be looking at two dimensions R^2 and three dimensions R^3 to make the display easier.

Now that we have set-up a playground we can just load up our tool of choice like scipy with its optimization library. Our goal is to capture how the optimizer finds the minimum on the benchmark function given different starting points. We just pick and choose any optimizers like simple gradient descent “CG” or the more intricate BFGS. It goes like this:

from scipy.optimize import minimize

# import benchmark function with its derivatives
from scipy.optimize import rosen
from scipy.optimize import rosen_der
from scipy.optimize import rosen_hess

import numpy as np

def banana(x, y):
a nice 2D way to look at Rosenbrock 
    return (1 - x)**2 + (100 * (y - (x**2))**2)

# we select some interesting optimizers:
OPTIMIZERS = ['CG', 'BFGS', 'Newton-CG', 'dogleg']
FUNCTIONS = {'rosen': [rosen, rosen_der, rosen_hess, banana]
# feel free to add or implement ellipsoid or other benchmark functions here with their derivatives

start_values = np.random.uniform(low=x_min, high=x_max, size=(10, 2))

start_iter = 1
max_iter = 50
step = 1
x_min, x_max = -5, 5
y_min, y_max = -5, 5

def optimize_funct(fname, funct, x0, derivative, hess, optimizer, iteration):
Just a wrapper around the scipy minimize function
  fitted_min =  minimize(funct, x0=x0, method=optimizer, 
                         jac=derivative, hess=hess,
  return fitted_min

# run actual optimization
fitted_optimizers = {}
for opt in OPTIMIZERS:
  fitted_optimizers[opt] = {}
  for name, functs in FUNCTIONS.items():
    fitted_optimizers[opt][name] = []
    f = functs[0]
    fd = functs[1]
    fdd = functs[2]
    f_2d = functs[3]
    for vals in start_values:
      computed_values = []
      x, y = vals
      z = f_2d(x,y)
      # include start values before optimization
      for i in range(start_iter, max_iter, step):
        out = optimize_funct(fname=name, funct=f, x0=vals, derivative=fd, hess=fdd,
                            optimizer=opt, iteration=i)
        # only save the output values (stored under x)
        x, y = out.x
        z = f_2d(x, y)
        computed_values.append(np.array([x, y, z]))

Not all optimizers are created equal. Some require the first and some the second derivative for our optimization functions aka the Jakobian and the Hessian . For an extra brief recap of what those are and how they work you can look here.
One thing to keep in mind is, that we start at different starting positions and let the optimizer roam freely – we are not doing constraint optimization just yet.

Fig. 2 – Contour plot of the Ellipsoid function from different starting points using standard gradient descend. Each step is a “+”. The optimizer finds the global minimum after a few steps.
Fig. 3 – Contour plot of the Rosenbrock- (Banana-) function. Gradient descent first walks into the right direction (see black “+”), but it looks for the global minimum to no avail. The valley it traverses is already low in function-value and information on where to look next is not as informative.
Fig 4. Attr. Sector contour plot of the function. BFGS optimization on the function space. Given three starting points (three colors) each step is depicted by a black cross. The minimum is quickly discovered and we see how a linear method (CG) behaves differently compared to BFGS which is from the family of quasi-newton methods.

Personally it helped me a lot to visualize how the optimization procedures move through the different function-spaces. Here you can see how CG and BFGS perform. Now CG is scipy’s implementation of gradient descent . It is the plain and simple optimization procedure and a classical line-search method. BFGS on the other hand uses approximated higher order information. BFGS therefore classifies as a quasi-Newton method.

A quasi-Newton method approximates higher order information like the Hessian Matrix (second order derivatives) and makes improvements per each step taken. [2]

Line-Searching | Backtrack Everything.

One method of optimization is simply following a line given the underlying function (see Fig. 2 and Fig. 3). We keep some information through backtracking, but that is about it. In order to optimize we may utilize first derivative information of the function. An intuitive formulation of line search optimization with backtracking is:

  1. Compute gradient at your point
  2. Compute the step based on your gradient and step-size
  3. Take a step in the optimizing direction
  4. Adjust the stepsize by a previously defined factor e.g. \alpha
  5. repeat until minimum is found or the difference in value between your steps is very (very) small – e.g. 10^{-8}
Gradients give us information on how the slope of a function behaves. We take the gradient as the first-order-derivative or partial derivative of a function. It is depicted as a vector and defined as \nabla f(x)=[\frac{\delta f}{\delta x_1}\ldots\frac{\delta f}{\delta x_n}]^T. 

Trust-Region Optimizers | Thinking in Circles

Instead of just looking straight ahead and straight back we can also look around us in a 360 degree fashion. We look at our function-information like the Jacobian and Hessian and make an optimal step in the right direction. In order for this to work we consider the quadrativ model function m_k(p)=f_k+g_k^Tp+\frac{1}{2}(p^TBp) where g is the gradient of f and B the Hessian.
When we do the optimization we optimize for p and adjust the radius that is around us appropriately. This looks like:

Fig. 5 Trust-Region Optimization on the contour of the attr.-sector function. Each step is depicted as “+” and around each step is the region from which the next step is chosen as a circle.

An intuitive formulation of the procedure can look like this
(see the Appendix for a formal representation):

  1. Compute Hessian and Gradient of your function
  2. Choose a point in a radius around your position that is optimal
  3. Take a step given the computed information
  4. Adjust the size of the radius given on how much or how little you have improved
  5. Repeat until convergence or tolerance is reached

Of course these short steps don’t quite capture the complexity, you need more parameters and a bit more math to find the right point in your radius, adjust the step and the radius if that is interesting to you see Alg. 1 and 2 in the Appendix.

We can observe an algorithm in practice and look at how the radius changes over time:

Fig. 6 – magnitude changes of function values (y axis) over steps taken (x axis). The radius \Delta changes with the improvement of the function-values.

We see that if the function values improve over time then our radius gets smaller (see Fig. 5 and 6) whereas if the values are bad initially we increase our radius. See the first steps of the optimization on the attr. sector function in Fig. 6.

Conclusion | The Rabbit Hole

We have seen how two families of optimizers work: the straight-lined linear optimizer and the trust-region optimizer. This is only the entrance of the rabbit hole. With each function and problem we look at we can discover something new. We can analyze how our Jacobian or Hessian matrix have to behave to make our optimizer do its work. We have to look at the shape and properties of our functions.
Obviously there exists a vast number of available proofs and theorems that are the mathematical backbone of what we have just looked at.
If you want to dive deep and get your hands on the raw mathematics I recommend going through the book Numerical Optimization (2nd ed.) by J. Nocedal and S. J. Wright [3].


Much of the work that I have shown here I had done in attending the Numerical Optimization course at KU – DIKU. Without the guidance of the course-officials this would not have been possible.


[1] MathWorks Test functions for global optimization algorithms
[2] C. Geiger and C. Kanzow . Quasi-Newton-Verfahren Springer 1999
[3] J. Nocedal and S. J. Wright Numerical Optimization (2nd ed.) Springer 2006


Trust Region Algorithm

Algorithm for the region of trust optimization in Fig. 5
Alg. to solve for p – solving for lambda is another subproblem.