Millennial Suicides – an Analysis about Change

A simple, yet meaningful Pyro model to illustrate change-points over time.

Walking alone. A photo by Mitchel Lensink on Unsplash.

One profound claim and observations by the media is, that the rate of suicides for younger people in the UK have risen from the 1980s to the 2000s. You might find it generally on the news , in publications or it is just an accepted truth by the population. But how can you make this measurable?
In order to make this claim testable we look for data and find an overview of the suicide rates, specifically England and Wales, at the Office for National Statistics (UK) together with an overall visualization.

Making an assumption tangible

Generally, one type of essential questions to ask – in everyday life, business or academia – is when changes have occurred. You are under the assumption that something has fundamentally changed over time. In order to prove that you have to quantify it. So you get some data on the subject-matter and build a model that shows you when changes have occurred as well as their magnitude. We are going to look at exactly how to do that.

Millennial Suicides Specifically

Now that we have our data, we take the average for the age-group of millennials. That entails ages in the range of 25 to 35, for the sake of simplicity. Given our dataset, we can visualize the average suicide count for millennials since the 1980s like this:

suicide_millenials_df = suicide_data[suicide_data.Age.isin(range(23,35))].groupby(["Year"]).mean().reset_index()
# round values to closest integer
suicide_millenials_df["Suicides"] = suicide_millenials_df.Suicides.round()
Yearavrg. Suicide per 100.000
What the head of your dataframe should look like.
Fig. 1 – The average suicides (per 100.000 people) in the age group 25 to 35 for England and Wales from 1981 to 2017.

From this data we can clearly see a steady increase over time until the very early 2000s followed by a decline until the 2010s. The inquisitive side of us wants to know when exactly that happened so that we can go out and look for the root causes and the underlying issues associated with those trends. Lets build a model to find out exactly in which years to look for that substantial change. To look for a specific change-point we will use the Pyro PPL as a probabilistic framework.

The How

We will use the mathematical underpinnings that are described in [2]. This entails:
a uniform prior distribution over the years T , the rate of change \mu from a half-normal distribution (the positive side of the Gaussian) in inversion-bracket notation. Looking at the data we set a large scale for our rate as we want to capture the change from 75 to 130 – something around 50. Lastly we find the observed rate of suicides through a Poisson regression.
Note: The Poisson regression entails, that we will sample from a normal distribution.

T \sim \mathcal{U}(1981,2017), \mu_0\sim\mathcal{N}^+(0, 50), \mu_1\sim\mathcal{N}^+(0, 50), n_t\sim\text{Poisson}(\mu_{[t>T]})

These parameters can be adjusted – maybe you look for different scales of changes, either across time or effect-size. One might consider a half-Cauchy distribution for more extreme values, maybe an increase or decrease on the scale.

The Pyro Model

def model(years, suicides):
    σ = pyro.param('σ', torch.ones(data_len))
    T = pyro.sample('change', dist.Uniform(1981, 2017))
    grp = (years > T) * 1
    # Independent events 
    with pyro.plate('rate', 2):
        μ = pyro.sample('μ', dist.HalfNormal(scale=50))
    y_obs = pyro.sample('obs', dist.Normal(μ[grp], σ), obs=suicides)

The grp gives us the index for our bracketed \mu .
The model observations are then sampled from the Normal distribution taking into account \mu and a scale=\sigma=1 (last line in the model).

Before we can run this model we convert our dataframe to Pyro readable tensors using PyTorch, like so:

years = torch.from_numpy(suicide_millenials_df.Year.values)
suicides = torch.from_numpy(suicide_millenials_df.Suicides.values)
data_len = len(suicide_millenials_df)

For the fitting procedure we perform Bayesian Inference utilizing MCMC methods. We sample for 1000 iterations and let the sampler burn-in up for 300 iterations.

SAMPLES = 1000
WARMUP = 300
nuts_kernel = NUTS(model)
mcmc = MCMC(nuts_kernel, num_samples=SAMPLES, warmup_steps=WARMUP), suicides)

We specifically use the hands-off NUTS sampler to perform inference and find values for our Pyro parameters and recommend that you do the same.
Beware that in contrast to Variational inference approaches the MCMC takes its time. After training we find a relatively high acceptance probability of around 91.2% and a stepsize \epsilon of around 0.00198.
Checking the model summary we see that everything is in order- no divergences – maybe our R-hat values are even too high.

            mean       std    median      5.0%     95.0%     n_eff     r_hat
    change   1995.40      2.10   1995.40   1991.72   1998.24      3.04      1.94
      μ[0]     96.21     10.41     97.56     93.81    101.23     20.16      1.08
      μ[1]     87.62      2.59     87.90     83.01     90.94      3.03      1.95

Now with the samples at hand, we can display when a change-point has occurred most frequently over time:

Fig. 2 – Histogram of the occurrence of the change-point across years.

We go ahead and plot our different rates over the given time

def pl(pt):
    grp = (years > pt['change']) * 1
    line = plt.plot(years, pt['rate0'] * (1 - grp) + pt['rate1'] * grp, 
      color='red', alpha=0.005)    

df.apply(pl, axis=1)
Fig. 3 – Change-rate over time (red-line) for 1000 samples. There is a change around 1995 to 1996 to mark the downwards trend.
Fig. 4 – Another rate of change over time as the red line. We see over three events between 1986 to 1989 the change has manifested itself. This run took 2000 samples.

First we ran the previously described model with a 1000 samples. This uncovered the change-point around the year 1995 to indicate the upcoming downwards trend. The output can be seen in Fig.3 where the rate is indicated as a red-line. Here model has quantified the change-point for the better development.
We took another shot at running the model for which we took 2000 samples and the model converged on a time-range in the years of ’86 to ’89 for yet another change-point. This marks the potential turning-point for an upwards trend. If we want to uncover the cause behind an increase in suicides among Millennials we have to look for reasons in that time-range or earlier changes, of which the effects have manifested themselves into this time.

Open Questions

We have shown that there is not just one change happening over time. There is a range of points for an increase in the rate and a range of points for a decreasing rate. One might extend the model to pick up more than one change-point, making the \mu three dimensional or fitting an entirely different model.

Concluding the Model

We have shown how to get from data, to a concrete formulation of a change in time and their underlying probabilities. This was directly translated in Pyro and through sampling we could infer concrete change-points. Probabilistic Programming is a powerful engine.

The complete notebook can be found here:

The Gravity on the Subject

Though this was intended as an introductory piece on the topic of Data Science, the topic of self-harm and death is a grave one. In case you or a loved one are affected there is help and resources out there. So I will leave you with this message here:


  1. Office for National Statistics, UK. Middle-aged generation most likely to die by suicide and drug poisoning. Dataset on suicide-rates. 2019.
  2. C. Scherrer. Bayesian Changepoint Detection with PyMC3. 2018.
  3. M. Hoffman, A. Gelman. The No-U-Turn Sampler. JMLR v15. 2014.

Improve Your Model with Missing Data | Imputation with NumPyro

Make the best of missing data the Bayesian way. Improve model performance and make comparative benchmarks using Monte Carlo methods.

A Missing frame ready to throw you off your model. Photo by Vilmos Heim on Unsplash.

The Ugly Data

How do you handle missing data, gaps in your data-frames or noisy parameters?
You have spent hours at work, in the lab or in the wild to generate or curate a dataset given an interesting research question or hypothesis. Terribly enough, you find that some of the measurements for a parameter are missing!
Another case that might throw you off is unexpected noise that was introduced at some point in the experiment and has doomed some of your measurements to be extreme outliers.
Sometimes you cannot just exile a whole parameter from your analysis, simply because it has 25% missing entries or maybe entails a “does not apply” field in your questionnaire.

Fear not, there is a way you can make the most of your dataset anyways.
In this post we will use Probabilisitc Programming [5] to fill the gaps that your model might face.

Simulate Missing Data

Lets build a problem from scratch. The studious reader is most likely familiar with the Iris flower data-set. It is somewhat part of the “hello-world” in the ML community; next to the well-known MNIST that is.
We use sklearn.datasets to get the data and put it into a pandas dataframe like so.:

iris_data = datasets.load_iris() 
iris_df = pd.DataFrame(data=np.c_[iris_data['data'], iris_data['target']],
                      columns=iris_data['feature_names'] + ['target'])

To make this example more explicit lets select for only 2 out of the given 3 species.:

iris_df["target"] ="int")
iris_df = iris_df[[0, 1])]

Now you should see 100 measurements (rows that are our n) of the Iris flower data.
We now take out 50% of the petal length entries by chance and assign NaN values with a bit of numpy and pandas magic.

random_vec = np_random(iris_df["petal length (cm)"].shape)<0.5 # 50% marked by numpy random vector on the column
iris_df["petal length (cm)"] = iris_df["petal length (cm)"].where(random_vec, other=np.nan)

Why do we care about all of our measurements?

When we plot our data, we can clearly see that the petal length and petal width can help us identify which species (target-column) we have (see Fig. 1).

Fig. 1: The relation of petal length and width toward the flower species aka. target-column. We can clearly identify two clusters.

We can make an even clearer distinction by regressing on the groups using just the vector which we have decimated earlier.

Fig. 2: The regression of the petal-length parameter on the identified species (target)

Too bad about half of the data is missing. So what do we do now?

Different Shades of Missing

First off, there are different degrees of missing data:

Missing Completely at Random (MCAR) and Missing at Random (MAR).

For a formal introduction to this see Bayesian Data Analysis [1] Ch. 18

One important distinction is that MCAR treats your missing parameters as independent of your observed measurements (lets call them y).
In our example we rely on our parameter being missing at random or MAR. That means we don’t think that the chance of a value gone missing is dependent on the parameter property BUT we think that there is potentially some relation between our length measure and the target species.

A Regression Model to Rule them all

First lets extract some basic info from the data of what we have so far:

>>> print("Petal Length μ = {}".format(iris_df["petal length (cm)"].mean()))
>>> Petal Length μ = 2.8072727272727267
>>> print("Petal Length σ^2 = {}".format(iris_df["petal length (cm)"].std()))
>>> Petal Length σ^2 = 1.4923079993698234

Now we build the model to do the imputation. For this we use the tool NumPyro, which is light-weight probabilistic numpy backend to Pyro – a heavy duty probabilistic language.
This comes in handy after formulating the model. We will use MCMC methods to fit the model and simulate values for missing entries.
To make this more applicable to a use case we combine this with a Bayesian regression-model. That means we want to model the distinction in species given all parameters that we know. Our parameters therefore are:

sepal length := s_len, sepal width := s_width, petal width := p_width and, lastly petal length := p_len, which includes our missing data. 

For each of those parameters we introduce a learnable bias which comes from a regular Normal distribution aka Gaussian for the sake of simplicity. So lets get started.:

def model(s_len, s_width, p_len, p_width, target=None):
    b_s_len = numpyro.sample("b_s_len", dist.Normal(0,1))
    b_s_width = numpyro.sample("b_s_width", dist.Normal(0,1))
    b_p_width = numpyro.sample("b_p_width", dist.Normal(0,1))

One can do a bit of tweaking for the bias parameters above, but for our initial runs the normal Gaussian worked reasonably well.
Now the delicate part. We introduce the distribution which we assume lies under our half-destroyed parameter. As with a lot of things we assume this to be Gaussian, which in our case is a reasonable assumption. Therefore we take the get loc=\mu (2.8 +/- 0.2) and scale=\sigma^2 (1.5+/-0.2) which we had computed earlier:

    len_mu = numpyro.sample("p_length_mu", dist.Normal(2.8, 0.2))
    len_sigma = numpyro.sample("p_length_sigma", dist.Normal(1.5, 0.2))

The informed reader clearly sees how the \mu and \sigma themselves are probabilistic parameters for themselves which we model in the process. For both we assume Gaussians but more rigorous distributions could be chosen.
Now we have to find our targets which we model. Hence we localize our missing positions in the vector with numpy.

    len_is_nan = np.isnan(p_len)
    len_nan_idx = np.array(np.isnan(p_len).astype(int)).nonzero()[0]
    if target is not None:
        len_impute = numpyro.param("len_impute", np.zeros(len_is_nan.sum()))
        len_impute = numpyro.sample("len_impute", dist.Normal(len_mu[len_nan_idx], 

Additionally, we tell NumPyro to make the imputation-value (len_impute) a learnable parameter and fill with as many zeroes as values we are missing.
On a short sidenote: our model has a predictive mode, that occurs when no target (aka species) is provided. When we invoke the predictive capabilities, we sample from the introduced distribution with the sampled \mu and sampled \sigma instead. Make sure to only apply this on the NaN positions in your vector! That is the meaning of the else clause:

numpyro.sample("len_impute", dist.Normal(len_mu[len_nan_idx], len_sigma[len_nan_idx])

We then put the imputed values into the right places and give it the inference sampling, with the observed length values to do its optimization. For this we use the JAX library with some efficient index updating – linear-algebra style.:

   p_len = ops.index_update(p_len, len_nan_idx, len_impute)
   numpyro.sample("p_length", dist.Normal(len_mu, len_sigma), obs=p_len)

In our length-sampling we include the actual length measures as the observations in the process .

Putting Things Together | Imputed Values into the Linear Model

For this model to be a proper linear model, we have to fit our target y (the species).
Since we have two classes, we do this by invoking a Bernoulli distribution.
We therefore ask in a Bernoulli process whether we are looking at y=0 or y=1 – two classes, two potential outcomes.

The Bayesian Regression – Code

We introduce two parameters, one of which is a bias term for the length parameter. We compose the

    l = numpyro.sample("l", dist.Normal(0, 1))
    b_len = numpyro.sample("b_length", dist.Normal(0, 1))
    logits = l + b_len * p_len
    logits = logits + b_s_len*s_len + b_s_width*s_width + b_p_width*p_width

Now to the magical part. We use the logit for the Bernoulli, which composes of all the parameters that we put into the model in interaction with their biases. The observed value in this process is now the target-parameter, our y.

   numpyro.sample("target", dist.Bernoulli(logits=logits), obs=target)

The complete NumPyro imputation model

def model(s_len, s_width, p_len, p_width, target=None):
    b_s_len = numpyro.sample("b_s_len", dist.Normal(0,1))
    b_s_width = numpyro.sample("b_s_width", dist.Normal(0,1))
    b_p_width = numpyro.sample("b_p_width", dist.Normal(0,1))
    # impute length
    len_mu = numpyro.sample("p_length_mu", dist.Normal(2.8, 0.2))
    len_sigma = numpyro.sample("p_length_sigma", dist.Normal(1.5, 0.2))
    len_is_nan = np.isnan(p_len)
    len_nan_idx = np.array(np.isnan(p_len).astype(int)).nonzero()[0]
    if target is not None:
        len_impute = numpyro.param("len_impute", np.zeros(len_is_nan.sum()))
        len_impute = numpyro.sample("len_impute", dist.Normal(len_mu[len_nan_idx], 
    p_len = ops.index_update(p_len, len_nan_idx, len_impute)
    numpyro.sample("p_length", dist.Normal(len_mu, len_sigma), obs=p_len)
    l = numpyro.sample("l", dist.Normal(0, 1))
    b_len = numpyro.sample("b_length", dist.Normal(0, 1))
    logits = l + b_len * p_len
    logits = logits + b_s_len*s_len + b_s_width*s_width + b_p_width*p_width
    if target is None:
        # prediction case
        probs = expit(logits)
        numpyro.sample("probs", dist.Delta(probs))

    numpyro.sample("target", dist.Bernoulli(logits=logits), obs=target)

Run It

To fit the model we use Bayesian Inference utilizing a MCMC method; the NUTS sampler to be specific. The beauty of the NUTS sampler is that it is a hands-off approach, as it adjusts step sizes by itself, given its stepping through the problem-space. We let it have 5000 warm-up or burn-in runs, which are discarded in the process and let it run for 10000 iterations, just to be sure. Our inference finishes with an acceptance probability of 0.96 or 96% – neat!. We then have a look at the summary what has happened:

mcmc = MCMC(NUTS(model=model), 5000, 10000, num_chains=1), **iris_data)

Fig. 3: Model Summary after MCMC indicates convergence and solid R-hat values as metric.

The output clearly shows convergence of the parameters which we sampled in our inference. Specifically the R-hat values are all >= 1.0 .

We then pull the samples and visualize the posterior – we rely on JAX again for the formatting magic:

samples = mcmc.get_samples()
posterior_mu = jnp.expand_dims(samples['l'], -1) + \
               jnp.expand_dims(samples['b_length'], -1)*iris_df["petal length (cm)"].values + \
               jnp.expand_dims(samples['b_s_len'], -1)*iris_df["sepal length (cm)"].values + \
               jnp.expand_dims(samples['b_s_width'], -1)*iris_df["sepal width (cm)"].values + \
               jnp.expand_dims(samples['b_p_width'], -1)*iris_df["petal width (cm)"].values
Fig. 4: Final Bayesian regression over MCMC samples with 90% CI.

There we have it. A clear distinction between our two classes. To be fair there is quite a bit of jitter going on above and below the bounds, but the regression-line could not be nicer.

We have shown that we can fill in the blanks in our model through MCMC-sampling.

For the code-savvy curious reader the final code can be found on my github and in Colab.

What if there are more than two classes, What about Categorical Data?

Great questions, which can be answered with NumPyro’s toolkit as well:

  1. Instead of having a Bernoulli trial in the model you can also use a Multinomial for your Bayesian inference.
  2. We can also use categorical data for model fitting. That looks quite different. Instead of using the one-dimensional array we rely on as many dimensions as there are categories.

A good example to fit with categorical data instead of continuous one is in the excellent kaggle notebook on Age Imputation given other parameters for the Titanic dataset:

The Dos and Don’ts

Please stick to what you have!

Do not invent data. I cannot stress that enough. It is a very fine line that we are walking here.
If your experiments have generated insufficient data or you had to throw some data out, then imputing values, might not be the best way to go. Even when your data might become presentable that way. When talking to a client or an academic journal you don’t get points for: we simulated parameter X and Y to 85.3% – no matter how nice the model looks.
Please do declare what methods, distributions and, assumptions you used to make the models and imputations for maximum transparency.
One recommendation is that you can always take out the data, fit a model without missing measurements and compare it to a model with imputed values in the end. That can make for a compelling story.


We have walked through data imputation, from the data-frame with holes in it to the final Bayesian inference and fitted posterior distribution.
We have shown how your Bayesian regression can benefit from an imputed vector and what the output looks like including confidence intervals.
Probabilistic Programming can be a powerful tool in your daily work.

Best of luck with your models and analysis!


[1] A. Gelman, J.B. Carlin, et. al., Bayesian Data Analysis . Third Edition.
[2] R. McElreath. Statistical Rethinking. 2016. CRC Press.
[3] Uber Technologies. Regression using NumPyro. 2019. NumPyro Documentations.
[4] M. Betancourt. A Conceptual Introduction to Hamiltonian Monte Carlo. 2020 on ArXiv.
[5] CS 4110 – Programming Languages and Logics. Probabilistic Programming. 2016. Cornell University.

Single-Parameter Models | Pyro vs. STAN

Photo by Joey Csunyo on Unsplash

Modeling U.S. cancer-death rates with two Bayesian approaches: MCMC in STAN and SVI in Pyro.

Single parameter models are an excellent way to get started with the topic of probabilistic modeling. These models comprise of one parameter that influences our observation and which we can infer from the given data. In this article we look at the performance and compare two well established frameworks – the Statistical Language STAN and the Pyro Probabilistic Programming Language.

Kidney Cancer Data

One old and established dataset is the cases of kidney cancer in the U.S. from 1980-1989, which is available here (see [1]). Given are U.S. counties, their total population and the cases of reported cancer-deaths.
Our task is to infer the rate of death from the given data in a Bayesian way.
An elaborate walk-through of the task can be found in section 2.8 of “Bayesian Data Analysis 3” [1].
Our dataframe looks like this:

We have the total number of datapoints N (all rows in the dataframe) and per county we have the observed death-count (dc) which we refer to as y and the population (pop), which we call n later.
Given that we have epidemiological data, we think that the Poisson distribution gives a good basis for estimating the rate that we want to compute. Therefore our observations y are sampled from a Poisson distribution. The interesting parameter is the \lambda for our Poisson, which we call rate. This death-rate comes from a Gamma distribution (see explanations in Chapter 2 [1]). A very short explanation why we use a Gamma here is that it is a conjugate prior with respect to Poisson.
Though there is more to the special relationship that the distributions have.

Putting it together we arrive at the formulation for the observations of death-cases:

and for the wanted single parameter rate:

Now that we have seen the tasks lets fire up our tools and get to work.

STAN – the classical statistical models

The statistician’s well established work-horse is the stats language STAN.
You write your problem as model-code and STAN will compile an efficient C++ model under the hood. From a compiled model you can sample and perform inference.
The Framework offers different interfaces for example in Python (PyStan), R (RStan), Julia and others. For this article we will use PyStan so that we can have both models neatly side by side in a notebook.
We start to define the data that we have given in STAN. This means the integer N for the total size of the data set, y the observed deaths and n the population per county. So that we have each N times.

data {
   int N; // total observations
   int y[N]; // observed death-count
   vector[N] n; // population

We already know that a rate parameter like the death-rate \theta is bounded in [0, 1]. That means you cannot have a rate less than 0 or higher than 1.
This \theta parameter is then basis for the actual rate – that accounts for a counties population. Thus, since we transform our parameter we put it into the transformed parameters bracket.:

parameters {
   vector<lower=0,upper=1>[N] theta; // bounded deathrate estimate

transformed parameters {
   vector[N] rate=n .* theta;

Now to the magical inference part, which is what the posterior is all about. We sample our \theta from a Gamma distribution. The \alpha (=20) and \beta (=430000) are given in [1], but one could easily compute them from the underlying dataset. Notice, that the model actually takes the transformed parameter and not just the \theta. This will become apparent in the output later.

model {
   theta ~ gamma(20,430000);
   y ~ poisson(rate);

Now that we have a STAN Model as a string, we need the data in the right format. That means arrays of integers from the data, like so:

dc = data["dc"].to_numpy().astype(int)
pop = data['pop'].to_numpy().astype(int)
kidney_cancer_dat = {'N': N,
                    'y': dc[0:N],
                    'n': pop[0:N]}

Since everything is set-up we can use PyStan to convert our Model to C++ code and use sampling to perform inference. The inference procedure is Markov Chain Monte Carlo or MCMC. The necessary parameters are the amount of samples that we take (the iterations) and the amount of chains over which we draw the samples. Each chain is an ordered sequence of draws.

sm = pystan.StanModel(model_code=kidney_cancer_code)
fit = sm.sampling(data=kidney_cancer_dat, iter=5000, chains=4)

After the fitting is done, the first thing to check is if the model has converged. That means that \hat{R} is >= 1 or we can just ask the diagnostics tools:


### the below lines are output:
n_eff / iter looks reasonable for all parameters
Rhat looks reasonable for all parameters
0.0 of 10000 iterations ended with a divergence (0.0%)
0 of 10000 iterations saturated the maximum tree depth of 10 (0.0%)
E-BFMI indicated no pathological behavior

We can also see that the chains have converged (right plots) from the traces of the performed inference:

Fig. 1 – Traceplot of the \theta posterior computation. The chains on the right side display convergence. Each chain has a density function for theta (left side), which display agreement over the values.

The chains (Fig. 1 right) should look like “caterpillars madly in love” and your diagnostics look good. So we know that our model has converged. We can now see the inferred theta values and can look at the individual densities:

az.plot_density(fit, var_names=["theta"])
Fig. 2 – Posterior densities on the fitted single parameter \theta after inference.
Warning: Do not use the complete dataset as input! This will lead to errors,
we went with N=25 samples for testing purposes. Feel free to test how many samples it takes to break PyStan (or STAN).

For doing Bayesian regression task there are also different modules that let you skip the “write-an-elaborate-statistical-model” part. BRMS in R is one of those excellent tools.

Pyro – the probabilistic programming approach

Going to a completely different paradigm is Pyro. Instead of performing MCMC through sampling we treat our task as an optimization problem.
For this we formulate a model, which computes our posterior (p). Additionally we also formulate a so-called guide which gives us a parameterized distribution (q), which is used for the model-fitting procedure.
In a previous SVI article we had already looked at how this optimization works and recommend a short recap.

The highly unintuitive part to this is that instead of simply fitting one parameter we comprise it of four Pyro parameters.:

  1. Firstly, \alpha and \beta are constants that give our Gamma distribution its shape like we did in the STAN part,
  2. We add two boring trainable parameters p1 and p2, which get allow for optimization in the SVI steps. Both parameters are positive – hence constraint=constraints.positive ,
  3. When going through the data, the observations themselves are independent events.
    This is done through Pyro’s plate modeling, which also supports subsampling from our dataset. A nice introduction to this can be found in [2].
ϵ = 10e-3
def model(population, deathcount):
    α = pyro.param('α', torch.tensor(20.))
    β = pyro.param('β', torch.tensor(430000.))
    p1= pyro.param('p1', torch.ones(data.shape[0]), constraint=constraints.positive)
    p2 = pyro.param('p2', torch.ones(data.shape[0]), constraint=constraints.positive)
    with pyro.plate('data', data.shape[0], subsample_size=32) as idx:
        n_j = population[idx]
        y_j = deathcount[idx]
        α_j = α + p1[idx]*y_j
        β_j = β + p2[idx]*n_j
        θ_j = pyro.sample("θ", Gamma(α_j, β_j))
        λ_j = 10*n_j*θ_j + ϵ
        pyro.sample('obs', Poisson(λ_j), obs=y_j)

The model performs the sampling of the observations given the Poisson distribution. This is comparable to what STAN was doing, with the difference that the parameters which make up \lambda and the underlying distributions are now trainable Pyro parameters.
In order for our model to not go up in flames we have to add a small number \epsilon to the rate otherwise Poisson will be unstable:

λ_j = 10*n_j*θ_j + ϵ
This is not the best way to model this task, but it is closest to the STAN model. One can find a model and guide that do not rely on y and n for computing alpha and \beta.

Now the guide gives us a parameterized distribution q with which we can perform optimization on minimizing the Evidence Lower bound or ELBO. The parameters are trainable Pyro parameters, which we have already seen in the model.

def guide(population, deathcount):
    α = pyro.param('α', torch.tensor(20.))
    β = pyro.param('β', torch.tensor(430000.))
    p1 = pyro.param('p1', torch.ones(data.shape[0]), constraint=constraints.positive)
    p2 = pyro.param('p2', torch.ones(data.shape[0]), constraint=constraints.positive)
    with pyro.plate('data', data.shape[0], subsample_size=32) as idx:
        n_j = population[idx]
        y_j = deathcount[idx]
        α_j = α + p1[idx]*y_j
        β_j = β + p2[idx]*n_j
        θ_j = pyro.sample("θ", Gamma(α_j, β_j))

Now we can run our SVI like so:

svi = SVI(model, guide, Adam({'lr': 0.025}), JitTrace_ELBO(3))
for i in tqdm(range(1000)):
    loss = svi.step(population, deathcount)
    print(f"Loss: {loss}")

When we plot the loss, we can see that the model improves over time. The lower the bound the better. The loss is the ELBO trace over the iterations.

Fig. 3 – SVI loss over iteration. Loss is the ELBO returned from the SVI step over 1000 iterations.

Now for the final check we can see that we have fitted appropriate parameters.
The \alpha and \beta variables make up for a good underlying Gamma distribution for our posterior inference.

Conclusion – What approach is best?

STAN has a straight-forward way to reason about the model. If the mathematical formulation of a posterior is known, it can be very straight forward to implement a model.
However STAN and its MCMC sampling have their limitations. Under default configuration it was not possible to run our model over all of the data.
Pyro does an excellent job at handling larger datasets efficiently and performing Variational Inference. As a Probabilistic Programming Language it can be written just like any other Python code. The model plus guide are also informative and encapsulate the problem well. Through this we transform the posterior computation into an optimization task and get sensible outputs.

Now that you are familiar with single-parameter models have fun working with bigger, more complex tasks. The Pyro Examples are a good place to start.

Happy inferring!


[1] A. Gelman, J.B. Carlin, et. al., Bayesian Data Analysis . Third Edition.
[2] Pyro Documentation – SVI part II

Your 10.000 hours in Data Science | A Walk down Data-Lane

The way from data novice to professional.

Clock with reverse numeral, Jewish Townhall Clock, Prague- by Richard Michael (all rights reserved).

There exists the idea that practicing something for over 10000 h (ten-thousand-hours) lets you acquire enough proficiency with the subject. The concept is based on the book Outliers by M. Gladwell. The mentioned 10k hours are how much time you spend practicing or studying a subject until you have a firm grasp and can be called proficient. Though this amount of hours is somewhat arbitrary, we will take a look on how those many hours can be spent to gain proficiency in the field of Data Science.
Imagine this as a learning budget in your Data-apprenticeship journey. If I were to start from scratch, this is how I would spend those 10 thousand hours to become a proficient Data Scientist.

The Core modules

Mathematics and Statistics: Basic (frequentist) statistics, Bayesian statistics, Intro to Data Analysis, some Probability Theory, some basic Analysis and Calculus.
A Data Scientists main job is to provide insights into data from problem domains. To have a firm grasp of the underlying mathematics is essential. You can find a lot of excellent content from available university courses and Coursera courses. This is a good way to get started on your journey and get a rudimentary understanding of how statistics work. Though, fundamental courses are essential, please challenge yourself with the real deal.
Lets give that 2500h (312 days of 8h work).
Analysis/Stats Languages and modules: R, Python (pandas), Julia, SAS.
This is the bread-and-butter for you and can be considered a part of your stats-education. Here it is about learning by doing. Reading a O’Reilly book on R will only get you so far. Load up a curated data-set, some kaggle challenge and work through some problems.
There is a spectrum on which statistical languages lie. So if you’re close to scientific computing you might consider Julia. The other end of the spectrum with just statistical analysis is SAS. Some people argue that R can do both.
1000h (125 days of 8h work)

Multi-purpose programming languages: Python, Go, Java, Bash, Perl, C++,… .
This very much depends on the systems that you are facing on a daily basis. If you just start out, pick a language and stick to it. Once you learn the concept you will pick up different languages easier.
I for myself rely heavily on a combination of Python and Bash for my daily work. Other tasks require a thorough understanding of the good old Java or even Perl to get started.
2000h (250 days of 8h work).

Database Technologies: {T-, My-, U-}SQL, PostgreSQL, MongoDB, … .
Relational or non-relational databases are some of the systems that you will have to work with in a production environment. It is useful to know, how your data is stored, how queries run under the hood, how to reverse a transaction. For later work your data-sources might vary a lot and it is good to have a basic understanding.
750h (94 days of 8h work)

Operating Systems: Windows, Linux, MacOS.
Whatever your work-environment is: master it! Know the ins-and outs. You might need to install some really weird library this one time to solve a specific problem (true story). Knowing where things are and why they are there goes a long way. Know what an SSH connection is and how to run analysis on a different system. Running analysis not on your local machine is going to happen at some point in the future.
500h (62 days of 8h work) .

ETL/ELT Systems: This is a mixture between programming languages, database technologies and operating systems. Frameworks like Spark, Hadoop, Hive offer a advanced means for data-storage and also analysis on cluster computing platforms. Some companies may rely on a different tech-stacks like Microsoft Azure Systems, Google Cloud or AWS solutions.
This goes hand in hand with Database Technologies and might already require a firm understanding of higher level programming languages (like Java or Python). There are also beginner systems, like KNIME to get your feet wet .
400 (50 days of 8h)

Your Problem Cases: This may be Fin-Tech, Bio-Tech, Business Analytics
You should be familiar with the problem domain in which you are working. Is it research and development that you are doing? Are you visualizing business processes and customer behavior? Different field require different insights. Your data has to make sense to you and you should be able to see if a model-output is in a valid range or totally off. Spend a very absolute minimum of 350h in your problem domain. We suggest more. A lot more.
You can decide if you want to be the method jack-of-all-trades or the expert in your field.
> 350h (44 days of 8h)

The attentive reader sees that this only adds up to 7500 hours so far. We have a basis now and you might want to go into a certain direction from here.

Different Data-trades – Your personal direction

Back in the olden days, the dark-unenlightened ages, we had guilds. The field of working with data also has different guilds. Types of tradesmen that solve different problems and there have been different articles on the subject.
Here is how those trades differ in their requirements:

  1. The Data Scientist: You do stats and analysis, you roll-out solutions and deploy platforms that answer questions and provide insights. Neat. Up your statistics game: + 200h. Maybe a bit of ML which we call statistical learning +100h and some type of multi-purpose-language like Python, Perl or Bash for your dayjob.
  2. The Data Engineer: You care about the data. You build the systems that enable the Data Scientist to work effectively and give the analysts their daily bread. Your focus lies on the underlying systems. +500h on Systems, +200h on ETL/ELT systems, +200h on DBs and programming languages +200h. You can even afford to not be so heavily involved in the statistics part (-500h on pure stats and theory).
  3. The Machine Learning Engineer: You implement the models that make the magic happen.
    You need all the statistical insights and also proficiency with all the ML on top of things –
    +300h on ML, + 200h on Stats. You should also be well versed in a high level programming language which makes implementing ML-models possible in the first place.
  4. The Analyst: You take data from the systems, make it pretty and reportable for management. You need to know what matters in your problem domain +200h in problem cases, +100h in DB systems. You need your SQL daily. Though there is a big variety of analyst jobs out there. So inform yourself and continue learning what you see is necessary.
Fig. 1 – Time and experience from a Data Scientist compared to an analyst’s job..
Fig. 2 – Time and experience of an ML-Engineer compared to a Data-Engineer.
Math/Stats = Mathematics and Statistics, includes your work with stat-languages like R.
Systems = Setting up and maintaining systems (e.g. ETL/ELT)
DBs = Database Systems (relational, non-relational)
SWE = Software Engineering (Python, Perl, Java, Scala, Go, JavaScript, …)
Exp. Field = Experience in the problem domain. Hours spent on the topic.

The Figures above illustrate exemplary careers and how much time and effort can be placed in each domain. The figures make suggestions and vary from individual to individual. For example the Data-Engineer in our example could also be called a Software Engineer that handles a lot of data and another engineer would need more experience working with databases compared to programming languages.

As with everything in life the lines are blurred. Sometimes a Data Scientist is also does the work of a ML Engineer and vice versa. Sometimes a Data Scientist also has to do Analytics Dashboards. Stranger things have happened in the past. Keep in mind that the 10k budget is your entry and gives you a direction into the field and not your specialization. So you have all the freedom in the world to eventually do that PhD in Parapsychology that you’ve always wanted.

How much work is ten thousand hours

10k hours is a long time. if you were to work for one year that would be 27h per day and not so feasible for the average mortal.
So better 3 years of work for around 9 h might be better suited.
If you’re studying on the side and look for a change in your job at some point in the future, then you might be set after a couple more years. The important part is to be consistent with it. A couple of quality hours a day over a few years will get you a long way.

Personal Note

My journey started in October 2016 coming from systems administration and starting a new undergraduate program. Now that it is July 2020 it has been around 1360 days of gaining experience in the field. Therefore I am in the field of >7500 h (with over 5.5h every day) and still continue to study and learn a lot every day. My undergraduate program in Cognitive Science helped me get a head-start in the fields of computer science, applied mathematics, Bayesian statistics, DB systems, formal logic, Machine Learning courses and much more. This together with working as a Data Engineer part-time helped me gain a lot of insights really quickly.

The Key element

I don’t care if your do your 10 thousand hours at Uni, in school, at seminars or at home.
Nobody cares. You have to show your work. A degree, if done right, can show this – but it doesn’t have to. It is about gaining experience in the field by solving problems and gaining proficiency with concepts.
Solve a problem and put it in a portfolio, your resume or simply your github-repository.
Whatever works best for you. People hire for your problem solving skills. Not for the A that you got on that one project 2 years ago. You might say that it was a really cool project altogether.
That’s great! if you made a dent in the universe you should tell people about it anyways.

The journey is a long one and if you enjoy what you do it is worthwhile. You will learn a lot along the way.

A short Disclaimer: Always keep in mind that there are different fields of Data Science related work out in the wild.
Every job has its specific purpose and requires a different tool-sets.
Every employer might look for different sets of skills.

Compute the Incomputable | What SVI and ELBO do

One reason why Bayesian Modeling works with real world data. The approximate light-house in the sea of randomness.

Photo by William Bout on Unsplash

When you want to gain more insights into your data you rely on programming frameworks that allow you to interact with probabilities. All you have in the beginning is a collection of data-points. It is just a glimpse into the underlying distribution from which your data comes. However, you do not just want to get simple data-points out in the end. What you want is elaborate, talkative density distributions with which you can perform tests. For this you use probabilistic frameworks like TensorFlow Probability, Pyro or STAN to compute posteriors of probabilities.
As we will see, the plain computation is not always feasible and we rely on Markov Chain Monte Carlo (MCMC) methods or Stochastic Variational Inference (SVI) to solve those problems. Especially over large data-sets or even every-day medium sized datasets we have to perform the magic of sampling and inference to compute values and fit models. If these methods would not exist we would be stuck with a neat model that we had thought up, but no way to know if it actually makes sense.
We have built such a model in a previous article on forecasting .
In this example we will look at inference and sampling from a Pyro point of view.

In short, instead of brute-forcing the computation Pyro makes an estimate. We can sample from that estimate and we can then compare that estimate to the actual data given our model. Then we make improvements – step by step. But why can’t we just compute the whole thing? I’m glad you asked.

From infinity to constant

Lets say we have a dataset D. We want to build a model (\theta) that accurately describes our data. For that we need a set of variables. These variables correspond to events that are random and we call them latent random variables e.g. z.
Now to see how well our model does we rely on posterior probability distribution P_{\theta}. If this is close to the real deal the likelihood of this posterior, compared to the observed data will be low. That is a good starting point, now how do I compute this P? In order for this to work, we have to do inference like this:


The math-savvy reader already sees that the right part of the equation above is in general not computable. We face the terror of the almighty integral over something we don’t know yet and given a continuous variable might be infinite. That is quite big.
What we can do instead is use a variational distribution; lets call it Q.
In an ideal world this approximate Q would be as close as possible to our P which we are trying to compute. So how do we get there?
One thing that we need first is a measure of how similar P and Q are. Enter the Kullback Leibler Divergence — you can find the legendary paper by Kullback and Leibler in the Annals of Mathematical Statistics [1].
Our tool to compare the distributions, the Kullback Leibler Divergence is
defined as :

    \begin{align*}KL(Q(z|x)||P(z|x)) = E_q\log\frac{Q(z|x)}{P(z|x)}\end{align}

This measure tells us how similar the distributions P and Q are. If they are similar, the KL is low and if they are very different, the divergence is large.

Now this measure requires us to know both distributions P and Q. However we don’t know P yet. After all, that is what we are computing and so far our Q is a wild guess.
After some reformulation we end up with this (see [5] for a step-by-step solution):

    \begin{align*}KL(Q(z|x)||P(z|x))=-E_q[\log P(z,x)-\log Q(z|x)] \\ + \log P(x)\end{align}

The first part: -E_q[\log P(z,x)-\log Q(z|x)] is what is called the evidence lower bound or ELBO. The second part: \log P(x) is a constant. That is much better. All we have to do now is maximize the ELBO to get minimal divergence between our P and Q.

Minimizing KL divergece corresponds to maximizing ELBO and vice versa.

Learn Q with Deep-Learning.

Now that we have the basic theory laid out, what do we do about our Q?
We can learn this estimate by using some form of neural network. We already have TensorFlow or PyTorch in the case of Pyro at our disposal. Compared to the model we can build something that models our variational distribution. As soon as we have that we can use stochastic gradient descent to optimize and get our model. So lets fire away.

We use our data as input and get Q as an output from the network. If we have this as a Gaussian then the network gives us a mean \mu and standard deviation \sigma.

If we picture this process in a very simple form, rolling out over time it looks something like this:

Fig. 1 – Function of model and guide when computing the posterior.

Over all our data we then observe various random events which we compare to what we have sampled at that point in time. We try to minimize the divergence – which means to maximize the evidence lower bound. At the last point in time the ELBO is maximized or the divergence is minimal. We then have fitted our model.

Talk Pyro to me

Lets take a simple example case in which we use a fit a model-function f, given the latent random variable z which comes from a \beta-distribution. The events we are observing are happening Bernoulli-random. This means that the event either occurs or it doesn’t. Lastly the events are i.i.d. – that is independent and identically distributed which we write as the `pyro.plate` notation. So that our model is:

 def model(self):
        # sample `z` from the beta prior
        f = pyro.sample("z", dist.Beta(10, 10))
        # plate notion for conditionally independent events given f
        with pyro.plate("data"):
            # observe all datapoints using the bernoulli distribution
            pyro.sample("obs", dist.Bernoulli(f),

So far this is standard Pyro and in our forecasting model we did not specify a guide even though it was still there. You can take full control of your model process, by specifying the guide as such. In most cases this just looks like your model only slightly different. In this guide the latent z comes from this:

 def guide(self):
        # register the two variational parameters with pyro
        alpha_q = pyro.param("alpha_q", torch.tensor(15),
        beta_q = pyro.param("beta_q", torch.tensor(15),
        pyro.sample("latent_z", NonreparameterizedBeta(alpha_q, beta_q))

Lastly we run our inference by using the SVI object. We provide the previously specified model and the specified guide. Also we rely on the standard Adam optimizer for stochastic gradient descent.

def inference(self):
        # clear everything that might still be around
        # setup the optimizer and the inference algorithm
        optimizer = optim.Adam({"lr": .0005, "betas": (0.93, 0.999)})
        svi = SVI(self.model,, optimizer, loss=TraceGraph_ELBO())

You’re all set. Now lets look at some output. For this to take some shape I have used a simple Bayesian regression task and compute the posterior for the model-function (see my github for the specifics).
We utilize Pyro’s SVI but also a Markov Chain Monte Carlo procedure with a NUTS sampler.

The Ugly Truth | Comparing SVI to MCMC

So far everything sounded very promising – almost too good to be true. With everything in life you should always check your results. This example below (Fig. 2-3) shows the fitted posterior distribution of the previously mentioned regression-task using SVI and MCMC side by side.

Fig. 2 – Bayesian Regression of a function f with two variables using SVI and MCMC.
Fig. 3 – Density distributions overlay of a Bayesian regression task with posterior distribution after model-fitting.

The in-depth understanding of how and why Markov methods and inference differ is topic of various research. As this is a deep-dive it will be the topic for another article. The only thing that we can conclude at this point is that the sampling routine has an evident impact on our final model.

What do I do with SVI

We have looked a seemingly unfeasible problem in the face and taken a look at the theory of the underlying the inference process to figure out why this works. You can use deep-learning frameworks like TF and PyTorch to build your probabilistic models by approximating another model-distribution and use MCMC or SVI for this. The goal of it all is to minimize divergence (or maximize the lower bound) between our two approximated posteriors.
We should always look at our output and ask ourselves if what we see makes sense. The sampling procedure apparently influences our posterior outcome – so we should be careful.
All in all SVI allows us to take any data-set and perform Bayesian Modeling with it. If you ask me, that is a pretty nice feature.


  1. S. Kullback and R.Leibler. On Information and Sufficiency. 1951 in Annals of Mathematical Statistics.
  2. D. Wingate and T. Weber. Automated Variational Inference in Probabilistic Programming. 2013 on arxiv.
  3. Pyro Documentation – SVI tutorial I
  4. Pyro Documentation – SVI tutorial II
  5. Pyro Documentation – SVI tutorial III
  6. Reza Babanezhad. Stochastic Variational Inference. UBC
  7. D. Madras Tutorial Stochastic Variational Inference. University of Toronto. 2017

The Complete Pyro Example

The metioned example simply serves as a rough overview of how a Pyro class is set up. The functional code can be found in the Pyro documentation (see [5]).

class BetaExample:
    def __init__(self, data):
        # ... specify your model needs here ...
        # the dataset needs to be a torch vector = data

    def model(self):
        # sample `z` from the beta prior
        f = pyro.sample("z", dist.Beta(10, 10))
        # plate notion for conditionally independent events given f
        with pyro.plate("data"):
            # observe all datapoints using the bernoulli likelihood
            pyro.sample("obs", dist.Bernoulli(f),

    def guide(self):
        # register the two variational parameters with pyro
        alpha_q = pyro.param("alpha_q", torch.tensor(15),
        beta_q = pyro.param("beta_q", torch.tensor(15),
        pyro.sample("latent_z", NonreparameterizedBeta(alpha_q, beta_q))

    def inference(self):
        # clear the param store
        # setup the optimizer and the inference algorithm
        optimizer = optim.Adam({"lr": .0005, "betas": (0.93, 0.999)})
        svi = SVI(self.model,, optimizer, loss=TraceGraph_ELBO())

How your model is optimized | Know your Optimization

The answer to: “Why is my model running forever?” or the classic: “I think it might have converged?”

The driver behind a lot of models that the average Data Scientist or ML-engineer uses daily, relies on numerical optimization methods. Studying the optimization and performance on different functions helps to gain a better understanding of how the process works.
The challenge we face on a daily basis is that someone gives us a model of how they think the world or their problem works. Now, you as a Data Scientist have to find the optimal solution to the problem. For example you look at an energy-function and want to find the absolute, global minimum for your tool to work or your protein to be stable. Maybe you have modeled user-data and you want to find the ideal customer given all your input-features — hopefully continuous ones.
Here is a comprehensive summary of how an optimizer works under the hood:

Benchmark Functions – the curved, the bent and the asymmetric

Before we meet our optimizers we have to get acquainted with the functions that we intend to optimize. My three personal favorites are:

  1. Mr. Nice-Guy is the ellipsoid function. It is overall nicely behaved, convex and has a global minimum.
    it is defined as f_1(x)=\sum_{i=1}^d\alpha\frac{i-1}{d-1}(x_i)^2
  2. The Rosenbrock or also banana-function. It is not as nice as our f_1 – through its valley of-death it can be a challenge to optimization algorithms. It is defined as f_2(x)=(1-x_1)^2+100*(x_2-x_1^2)^2.
  3. The twisted-one is our attractive-sector function. It is the most unusual of the selection as it is non-convex and asymmetric around the minimum. It is defined as f_3(x)=\sum_{i=1}^d h(x_i)^2+100(-x_i)^2 where h(x)=\frac{\log(1+\exp(qx))}{q}.

This is comprehensible selection of functions and there exists a multitude of available benchmark-functions to run your optimizer across. From the depths of the Styblinski-Tang function to the beautiful peeks of Rastrigin. Some of which are ready made and implemented like in [1].

As you might have noticed the functions work on R^n and we provide an input vector \textbf{x}\in R^n. We will only be looking at two dimensions R^2 and three dimensions R^3 to make the display easier.

Now that we have set-up a playground we can just load up our tool of choice like scipy with its optimization library. Our goal is to capture how the optimizer finds the minimum on the benchmark function given different starting points. We just pick and choose any optimizers like simple gradient descent “CG” or the more intricate BFGS. It goes like this:

from scipy.optimize import minimize

# import benchmark function with its derivatives
from scipy.optimize import rosen
from scipy.optimize import rosen_der
from scipy.optimize import rosen_hess

import numpy as np

def banana(x, y):
a nice 2D way to look at Rosenbrock 
    return (1 - x)**2 + (100 * (y - (x**2))**2)

# we select some interesting optimizers:
OPTIMIZERS = ['CG', 'BFGS', 'Newton-CG', 'dogleg']
FUNCTIONS = {'rosen': [rosen, rosen_der, rosen_hess, banana]
# feel free to add or implement ellipsoid or other benchmark functions here with their derivatives

start_values = np.random.uniform(low=x_min, high=x_max, size=(10, 2))

start_iter = 1
max_iter = 50
step = 1
x_min, x_max = -5, 5
y_min, y_max = -5, 5

def optimize_funct(fname, funct, x0, derivative, hess, optimizer, iteration):
Just a wrapper around the scipy minimize function
  fitted_min =  minimize(funct, x0=x0, method=optimizer, 
                         jac=derivative, hess=hess,
  return fitted_min

# run actual optimization
fitted_optimizers = {}
for opt in OPTIMIZERS:
  fitted_optimizers[opt] = {}
  for name, functs in FUNCTIONS.items():
    fitted_optimizers[opt][name] = []
    f = functs[0]
    fd = functs[1]
    fdd = functs[2]
    f_2d = functs[3]
    for vals in start_values:
      computed_values = []
      x, y = vals
      z = f_2d(x,y)
      # include start values before optimization
      for i in range(start_iter, max_iter, step):
        out = optimize_funct(fname=name, funct=f, x0=vals, derivative=fd, hess=fdd,
                            optimizer=opt, iteration=i)
        # only save the output values (stored under x)
        x, y = out.x
        z = f_2d(x, y)
        computed_values.append(np.array([x, y, z]))

Not all optimizers are created equal. Some require the first and some the second derivative for our optimization functions aka the Jakobian and the Hessian . For an extra brief recap of what those are and how they work you can look here.
One thing to keep in mind is, that we start at different starting positions and let the optimizer roam freely – we are not doing constraint optimization just yet.

Fig. 2 – Contour plot of the Ellipsoid function from different starting points using standard gradient descend. Each step is a “+”. The optimizer finds the global minimum after a few steps.
Fig. 3 – Contour plot of the Rosenbrock- (Banana-) function. Gradient descent first walks into the right direction (see black “+”), but it looks for the global minimum to no avail. The valley it traverses is already low in function-value and information on where to look next is not as informative.
Fig 4. Attr. Sector contour plot of the function. BFGS optimization on the function space. Given three starting points (three colors) each step is depicted by a black cross. The minimum is quickly discovered and we see how a linear method (CG) behaves differently compared to BFGS which is from the family of quasi-newton methods.

Personally it helped me a lot to visualize how the optimization procedures move through the different function-spaces. Here you can see how CG and BFGS perform. Now CG is scipy’s implementation of gradient descent . It is the plain and simple optimization procedure and a classical line-search method. BFGS on the other hand uses approximated higher order information. BFGS therefore classifies as a quasi-Newton method.

A quasi-Newton method approximates higher order information like the Hessian Matrix (second order derivatives) and makes improvements per each step taken. [2]

Line-Searching | Backtrack Everything.

One method of optimization is simply following a line given the underlying function (see Fig. 2 and Fig. 3). We keep some information through backtracking, but that is about it. In order to optimize we may utilize first derivative information of the function. An intuitive formulation of line search optimization with backtracking is:

  1. Compute gradient at your point
  2. Compute the step based on your gradient and step-size
  3. Take a step in the optimizing direction
  4. Adjust the stepsize by a previously defined factor e.g. \alpha
  5. repeat until minimum is found or the difference in value between your steps is very (very) small – e.g. 10^{-8}
Gradients give us information on how the slope of a function behaves. We take the gradient as the first-order-derivative or partial derivative of a function. It is depicted as a vector and defined as \nabla f(x)=[\frac{\delta f}{\delta x_1}\ldots\frac{\delta f}{\delta x_n}]^T. 

Trust-Region Optimizers | Thinking in Circles

Instead of just looking straight ahead and straight back we can also look around us in a 360 degree fashion. We look at our function-information like the Jacobian and Hessian and make an optimal step in the right direction. In order for this to work we consider the quadrativ model function m_k(p)=f_k+g_k^Tp+\frac{1}{2}(p^TBp) where g is the gradient of f and B the Hessian.
When we do the optimization we optimize for p and adjust the radius that is around us appropriately. This looks like:

Fig. 5 Trust-Region Optimization on the contour of the attr.-sector function. Each step is depicted as “+” and around each step is the region from which the next step is chosen as a circle.

An intuitive formulation of the procedure can look like this
(see the Appendix for a formal representation):

  1. Compute Hessian and Gradient of your function
  2. Choose a point in a radius around your position that is optimal
  3. Take a step given the computed information
  4. Adjust the size of the radius given on how much or how little you have improved
  5. Repeat until convergence or tolerance is reached

Of course these short steps don’t quite capture the complexity, you need more parameters and a bit more math to find the right point in your radius, adjust the step and the radius if that is interesting to you see Alg. 1 and 2 in the Appendix.

We can observe an algorithm in practice and look at how the radius changes over time:

Fig. 6 – magnitude changes of function values (y axis) over steps taken (x axis). The radius \Delta changes with the improvement of the function-values.

We see that if the function values improve over time then our radius gets smaller (see Fig. 5 and 6) whereas if the values are bad initially we increase our radius. See the first steps of the optimization on the attr. sector function in Fig. 6.

Conclusion | The Rabbit Hole

We have seen how two families of optimizers work: the straight-lined linear optimizer and the trust-region optimizer. This is only the entrance of the rabbit hole. With each function and problem we look at we can discover something new. We can analyze how our Jacobian or Hessian matrix have to behave to make our optimizer do its work. We have to look at the shape and properties of our functions.
Obviously there exists a vast number of available proofs and theorems that are the mathematical backbone of what we have just looked at.
If you want to dive deep and get your hands on the raw mathematics I recommend going through the book Numerical Optimization (2nd ed.) by J. Nocedal and S. J. Wright [3].


Much of the work that I have shown here I had done in attending the Numerical Optimization course at KU – DIKU. Without the guidance of the course-officials this would not have been possible.


[1] MathWorks Test functions for global optimization algorithms
[2] C. Geiger and C. Kanzow . Quasi-Newton-Verfahren Springer 1999
[3] J. Nocedal and S. J. Wright Numerical Optimization (2nd ed.) Springer 2006


Trust Region Algorithm

Algorithm for the region of trust optimization in Fig. 5
Alg. to solve for p – solving for lambda is another subproblem.

Belief Networks

or Bayesian Belief Networks combine probability and graph theory to represent probabilistic models in a structured way.

Read: 12 min

  • translate graphical models to do inference
  • read and interpret graphical models
  • identify dependencies and causal links
  • work through an example


Belief Networks (BN) combine probabilities and their graphical representation to show causal relationships and assumptions about the parameters – e.g. independence between parameters. This is done through nodes and directed links (vertices) between the nodes to form a Directed Acyclic Graph (DAG). This also allows to display operations like conditioning on parameters or marginalizing over parameters.

A DAG is a graph with directed vertices
that does not contain any cycles. 
Fig 1. A basic representation of a Belief Network as a graph.

Use and Properties

In Fig.1 we can see a model with four parameters lets say rain (R), Jake (J), Thomas (T) and sprinkler (S). We observe that R has an effect on our parameters T and J. We can make a distinction between cause and effect – e.g. the rain is causal to Jake being wet.
Further we can model or observe constraints that our model holds, such that we do not have to account for all combination of possible parameters (2^N) and can reduce the space and computational costs.
This graph can represent different independence assumptions through the directions of the vertices (arrows).

Contitional Independence

We know that when conditional independence holds we can express a joint probability as p(x_1,x_2,x_3)=(x_i_1 | x_i_2,x_i_3,)p(x_i_2|x_i_3)p(x_i_3). This gives six possible permutations, so that simply drawing a graph does not work.

Remember there should not be any cycles in the graph. Therefore we have to drop at least two vertices such that we get 4 possible DAGs from the 6 permutations.
Now, when two vertices point to one and the same node we get a collision:

Fig. 2 a) Collisions and induced dependence
Fig. 2b) Model conditioned on T

In case a collision occurs we get either d-separation or a d-connection. For example (Fig. 2) we see that X and Y are independent of each other and are ’cause’ to the effect Z. We can also write p(X,Y,Z)=p(Z|X, Y)p(X)p(Y).
However if we would condition a model on the collision node (see Fig. 2b), we get p(X,Y|T)\neq p(X|T)p(Y|T) and X and Y are not independent anymore.
A connection is introduced on the right model in Fig. 2a where X and Y are now conditionally dependent given Z.

Belief Networks encode conditional independence well, they do not encode dependence well.

Definition: Two nodes (\mathcal{X} and \mathcal{Y}) are d-separated by \mathcal{Z} in a graph G are if and only if they are not d-connected.
Pretty straight forward, right?

A more applicable explanation is: for every variable x in \mathcal{X} and y in \mathcal{Y}, trace every path U between x and y, if all paths are blocked then two nodes are d-separated. For the definition of blocking and descendants of nodes you can find more in this paper.

Remember: X is conditionally independent of Y if p(X,Y)=p(X)p(Y).

Now away from the theory to something more practical:

How to interact with a model

  1. Analyze the problem and/or set of parameters
    1. set a graphical structure for the parameters in a problem with a DAG
    2. OR reconstruct the problem setting from the parameters
  2. Compile a table of all required conditional probabilities – p(x_i|p_a(x_i))
  3. Set and specify parental relations


Recap Fig. 1 – our short example case

Thomas and Jake are neighbors. When Thomas gets out of the house in the morning he observes that the grass is wet (effect). We want to know how likely it is that Thomas forgot to turn of the sprinkler (S). Knowing that it had rained the day before explains away that the grass is wet in the morning. So let us look at the probabilities:

We have 4 boolean (yes/no) variables (see Fig. 1):
R (rain [0,1]), S (sprinkler [0,1]), T (thomas’s-grass [0,1]), J (jake’s-grass [0,1]). So that we can say that Thomas’s grass is wet, given that Jake’s grass is wet, the sprinkler was off and it rained as p(T=1 | J=1, S=0, R=1).
Already in such small example we have a lot of possible states, namely 2^N=2^4=16 values. We already know that our model has constraints, e.g. the sprinkler is independent of the rain and when it rains both Thomas’s and Jake’s grass is wet.

After taking into account our constraints we can factorize our joint probability into
p(T, J, R, S) = p(T|R,S)p(J|R)p(R)p(S) – we say: “The joint probability is computed by the probability that Thomas’s grass is wet, given rain and sprinkler multiplied with the probability that Jake’s grass is wet, given that it rained and the probability that it rained and the probability that the sprinkler was on.” We have now reduced our problem to 4+2+1+1=8 values. Neat!

We use factorization of joint probabilities to reduce the total number of required states and their (conditional) probabilities.

p(J=1 | R=1)1.0
p(J=1 | R=0)*0.2
p(T=1 | R=1, S=0)1.0
p(T=1 | R=1, S=1)1.0
p(T=1 | R=0, S=1)0.9
p(T=1 | R=0, S=0)0.0
Our conditional probability table (CPT)
* sometimes Jake’s grass is simply wet – we don’t know why…

We are still interested if Thomas left the sprinkler on. Therefore compute the posterior probability p(S=1|T=1) = 0.3382 (see below).

    \begin{align*} p(S=1|T=1) &= \frac{p(S=1,T=1)}{p(T=1)}\\ &=\frac{\sum_{J,R}p(T=1, J, R, S=1)}{\sum_{J,R,S}p(T=1, J, R, S)}\\ &=\frac{\sum_R p(T=1|R,S=1)p(R)p(S=1)}{\sum_{R,S}p(T=1|R,S)p(R)p(S)}\\ &=\frac{0.9*0.8*0.1+1*0.2*0.1}{0.9*0.8*0.1+1*0.2*0.1+0+1*0.2*0.9}\\ &= 0.3382 \end{align*}

When we compute the posterior given that Jake’s grass is also wet we get
p(S=1|T=1, J=1) = 0.1604

    \begin{align*} p(S=1|T=1, J=1) &= \frac{p(S=1,T=1, J=1)}{p(T=1, J=1)}\\ &=\frac{\sum_R p(J=1|R)p(T=1|R,S=1)p(R)p(S=1)}{\sum_{R,S}p(J=1|R)p(T=1|R,S)p(R)p(S)}\\ &=\frac{0.9*0.8*0.1+1*0.2*0.1}{0.9*0.8*0.1+1*0.2*0.1+0+1*0.2*0.9}\\ &= \frac{0.0344}{0.2144}=0.1604 \end{align*}

We have shown that it is less likely that Thomas sprinkler is on, when Jake’s grass is also wet. Where Jake’s grass is extra evidence that affects the likelihood of our observed effect.


Fig. 3 – Uncertainties and unreliable evidence

In case an observed outcome holds uncertainty or we do not trust the source of our values we can account for that. Lets assume the sprinkler malfunctions sometimes, so that instead of a hard [0,1] we get a vector S=[0.2, 0.8] instead, also called soft evidence. We denote this with dashed nodes (see Fig. 2).
Lets assume Jake’s house is quite far away and there is an added unreliability in the evidence that his grass is wet — maybe Jake is a notorious liar about the wetness of his soil. We denote this with a dotted vertex (see Fig.2).
To solve and to account for uncertainties and unreliability is a whole different topic, which we will address in another post.


Some dependency statements cannot be represented structurally in this way; for example marginalized graphs.
One famous example that BNs hold errors when it comes to causal relations is the Simpson Paradoxon . This can be solved with atomic intervention, which is a topic for yet another post.


  • (Bayesian) Belief Networks represent probabilistic models as well as the factorization of distributions into conditional probabilities
  • BNs are directed acyclic graphs (DAGs)
  • We can reduce the amount of computation and space required by taking into account constraints from the model which are expressed structurally
  • conditional independence is expressed as absence of a link in a network


  1. D. Barber, Bayesian Reasoning and Machine Learning. Cambridge University Press. USA. 2012: pp.29-51
  2. G. Pipa, Neuroinformatics Lecture Script. 2015: pp.19-26