Millennial Suicides – an Analysis about Change

A simple, yet meaningful Pyro model to illustrate change-points over time.

Walking alone. A photo by Mitchel Lensink on Unsplash.

One profound claim and observations by the media is, that the rate of suicides for younger people in the UK have risen from the 1980s to the 2000s. You might find it generally on the news , in publications or it is just an accepted truth by the population. But how can you make this measurable?
In order to make this claim testable we look for data and find an overview of the suicide rates, specifically England and Wales, at the Office for National Statistics (UK) together with an overall visualization.

Making an assumption tangible

Generally, one type of essential questions to ask – in everyday life, business or academia – is when changes have occurred. You are under the assumption that something has fundamentally changed over time. In order to prove that you have to quantify it. So you get some data on the subject-matter and build a model that shows you when changes have occurred as well as their magnitude. We are going to look at exactly how to do that.

Millennial Suicides Specifically

Now that we have our data, we take the average for the age-group of millennials. That entails ages in the range of 25 to 35, for the sake of simplicity. Given our dataset, we can visualize the average suicide count for millennials since the 1980s like this:

suicide_millenials_df = suicide_data[suicide_data.Age.isin(range(23,35))].groupby(["Year"]).mean().reset_index()
# round values to closest integer
suicide_millenials_df["Suicides"] = suicide_millenials_df.Suicides.round()
suicide_millenials_df.head()
Yearavrg. Suicide per 100.000
198185.0
198283.0
1983
What the head of your dataframe should look like.
Fig. 1 – The average suicides (per 100.000 people) in the age group 25 to 35 for England and Wales from 1981 to 2017.

From this data we can clearly see a steady increase over time until the very early 2000s followed by a decline until the 2010s. The inquisitive side of us wants to know when exactly that happened so that we can go out and look for the root causes and the underlying issues associated with those trends. Lets build a model to find out exactly in which years to look for that substantial change. To look for a specific change-point we will use the Pyro PPL as a probabilistic framework.

The How

We will use the mathematical underpinnings that are described in [2]. This entails:
a uniform prior distribution over the years T , the rate of change \mu from a half-normal distribution (the positive side of the Gaussian) in inversion-bracket notation. Looking at the data we set a large scale for our rate as we want to capture the change from 75 to 130 – something around 50. Lastly we find the observed rate of suicides through a Poisson regression.
Note: The Poisson regression entails, that we will sample from a normal distribution.

T \sim \mathcal{U}(1981,2017), \mu_0\sim\mathcal{N}^+(0, 50), \mu_1\sim\mathcal{N}^+(0, 50), n_t\sim\text{Poisson}(\mu_{[t>T]})

These parameters can be adjusted – maybe you look for different scales of changes, either across time or effect-size. One might consider a half-Cauchy distribution for more extreme values, maybe an increase or decrease on the scale.

The Pyro Model

def model(years, suicides):
    σ = pyro.param('σ', torch.ones(data_len))
    T = pyro.sample('change', dist.Uniform(1981, 2017))
    grp = (years > T) * 1
    # Independent events 
    with pyro.plate('rate', 2):
        μ = pyro.sample('μ', dist.HalfNormal(scale=50))
    y_obs = pyro.sample('obs', dist.Normal(μ[grp], σ), obs=suicides)

The grp gives us the index for our bracketed \mu .
The model observations are then sampled from the Normal distribution taking into account \mu and a scale=\sigma=1 (last line in the model).

Before we can run this model we convert our dataframe to Pyro readable tensors using PyTorch, like so:

years = torch.from_numpy(suicide_millenials_df.Year.values)
suicides = torch.from_numpy(suicide_millenials_df.Suicides.values)
data_len = len(suicide_millenials_df)

For the fitting procedure we perform Bayesian Inference utilizing MCMC methods. We sample for 1000 iterations and let the sampler burn-in up for 300 iterations.

SAMPLES = 1000
WARMUP = 300
nuts_kernel = NUTS(model)
mcmc = MCMC(nuts_kernel, num_samples=SAMPLES, warmup_steps=WARMUP)
mcmc.run(years, suicides)

We specifically use the hands-off NUTS sampler to perform inference and find values for our Pyro parameters and recommend that you do the same.
Beware that in contrast to Variational inference approaches the MCMC takes its time. After training we find a relatively high acceptance probability of around 91.2% and a stepsize \epsilon of around 0.00198.
Checking the model summary we see that everything is in order- no divergences – maybe our R-hat values are even too high.

            mean       std    median      5.0%     95.0%     n_eff     r_hat
    change   1995.40      2.10   1995.40   1991.72   1998.24      3.04      1.94
      μ[0]     96.21     10.41     97.56     93.81    101.23     20.16      1.08
      μ[1]     87.62      2.59     87.90     83.01     90.94      3.03      1.95

Now with the samples at hand, we can display when a change-point has occurred most frequently over time:

Fig. 2 – Histogram of the occurrence of the change-point across years.

We go ahead and plot our different rates over the given time

def pl(pt):
    grp = (years > pt['change']) * 1
    line = plt.plot(years, pt['rate0'] * (1 - grp) + pt['rate1'] * grp, 
      color='red', alpha=0.005)    

df.apply(pl, axis=1)
Fig. 3 – Change-rate over time (red-line) for 1000 samples. There is a change around 1995 to 1996 to mark the downwards trend.
Fig. 4 – Another rate of change over time as the red line. We see over three events between 1986 to 1989 the change has manifested itself. This run took 2000 samples.

First we ran the previously described model with a 1000 samples. This uncovered the change-point around the year 1995 to indicate the upcoming downwards trend. The output can be seen in Fig.3 where the rate is indicated as a red-line. Here model has quantified the change-point for the better development.
We took another shot at running the model for which we took 2000 samples and the model converged on a time-range in the years of ’86 to ’89 for yet another change-point. This marks the potential turning-point for an upwards trend. If we want to uncover the cause behind an increase in suicides among Millennials we have to look for reasons in that time-range or earlier changes, of which the effects have manifested themselves into this time.

Open Questions

We have shown that there is not just one change happening over time. There is a range of points for an increase in the rate and a range of points for a decreasing rate. One might extend the model to pick up more than one change-point, making the \mu three dimensional or fitting an entirely different model.

Concluding the Model

We have shown how to get from data, to a concrete formulation of a change in time and their underlying probabilities. This was directly translated in Pyro and through sampling we could infer concrete change-points. Probabilistic Programming is a powerful engine.

The complete notebook can be found here: https://github.com/RMichae1/PyroStudies/blob/master/changepoint_analysis.ipynb

The Gravity on the Subject

Though this was intended as an introductory piece on the topic of Data Science, the topic of self-harm and death is a grave one. In case you or a loved one are affected there is help and resources out there. So I will leave you with this message here:

References

  1. Office for National Statistics, UK. Middle-aged generation most likely to die by suicide and drug poisoning. Dataset on suicide-rates. 2019.
  2. C. Scherrer. Bayesian Changepoint Detection with PyMC3. 2018.
  3. M. Hoffman, A. Gelman. The No-U-Turn Sampler. JMLR v15. 2014.

Improve Your Model with Missing Data | Imputation with NumPyro

Make the best of missing data the Bayesian way. Improve model performance and make comparative benchmarks using Monte Carlo methods.

A Missing frame ready to throw you off your model. Photo by Vilmos Heim on Unsplash.

The Ugly Data

How do you handle missing data, gaps in your data-frames or noisy parameters?
You have spent hours at work, in the lab or in the wild to generate or curate a dataset given an interesting research question or hypothesis. Terribly enough, you find that some of the measurements for a parameter are missing!
Another case that might throw you off is unexpected noise that was introduced at some point in the experiment and has doomed some of your measurements to be extreme outliers.
Sometimes you cannot just exile a whole parameter from your analysis, simply because it has 25% missing entries or maybe entails a “does not apply” field in your questionnaire.

Fear not, there is a way you can make the most of your dataset anyways.
In this post we will use Probabilisitc Programming [5] to fill the gaps that your model might face.

Simulate Missing Data

Lets build a problem from scratch. The studious reader is most likely familiar with the Iris flower data-set. It is somewhat part of the “hello-world” in the ML community; next to the well-known MNIST that is.
We use sklearn.datasets to get the data and put it into a pandas dataframe like so.:

iris_data = datasets.load_iris() 
iris_df = pd.DataFrame(data=np.c_[iris_data['data'], iris_data['target']],
                      columns=iris_data['feature_names'] + ['target'])

To make this example more explicit lets select for only 2 out of the given 3 species.:

iris_df["target"] = iris_df.target.astype("int")
iris_df = iris_df[iris_df.target.isin([0, 1])]

Now you should see 100 measurements (rows that are our n) of the Iris flower data.
We now take out 50% of the petal length entries by chance and assign NaN values with a bit of numpy and pandas magic.

random_vec = np_random(iris_df["petal length (cm)"].shape)<0.5 # 50% marked by numpy random vector on the column
iris_df["petal length (cm)"] = iris_df["petal length (cm)"].where(random_vec, other=np.nan)

Why do we care about all of our measurements?

When we plot our data, we can clearly see that the petal length and petal width can help us identify which species (target-column) we have (see Fig. 1).

Fig. 1: The relation of petal length and width toward the flower species aka. target-column. We can clearly identify two clusters.

We can make an even clearer distinction by regressing on the groups using just the vector which we have decimated earlier.

Fig. 2: The regression of the petal-length parameter on the identified species (target)

Too bad about half of the data is missing. So what do we do now?

Different Shades of Missing

First off, there are different degrees of missing data:

Missing Completely at Random (MCAR) and Missing at Random (MAR).

For a formal introduction to this see Bayesian Data Analysis [1] Ch. 18

One important distinction is that MCAR treats your missing parameters as independent of your observed measurements (lets call them y).
In our example we rely on our parameter being missing at random or MAR. That means we don’t think that the chance of a value gone missing is dependent on the parameter property BUT we think that there is potentially some relation between our length measure and the target species.

A Regression Model to Rule them all

First lets extract some basic info from the data of what we have so far:

>>> print("Petal Length μ = {}".format(iris_df["petal length (cm)"].mean()))
>>> Petal Length μ = 2.8072727272727267
>>> print("Petal Length σ^2 = {}".format(iris_df["petal length (cm)"].std()))
>>> Petal Length σ^2 = 1.4923079993698234

Now we build the model to do the imputation. For this we use the tool NumPyro, which is light-weight probabilistic numpy backend to Pyro – a heavy duty probabilistic language.
This comes in handy after formulating the model. We will use MCMC methods to fit the model and simulate values for missing entries.
To make this more applicable to a use case we combine this with a Bayesian regression-model. That means we want to model the distinction in species given all parameters that we know. Our parameters therefore are:

sepal length := s_len, sepal width := s_width, petal width := p_width and, lastly petal length := p_len, which includes our missing data. 

For each of those parameters we introduce a learnable bias which comes from a regular Normal distribution aka Gaussian for the sake of simplicity. So lets get started.:

def model(s_len, s_width, p_len, p_width, target=None):
    b_s_len = numpyro.sample("b_s_len", dist.Normal(0,1))
    b_s_width = numpyro.sample("b_s_width", dist.Normal(0,1))
    b_p_width = numpyro.sample("b_p_width", dist.Normal(0,1))
    ...

One can do a bit of tweaking for the bias parameters above, but for our initial runs the normal Gaussian worked reasonably well.
Now the delicate part. We introduce the distribution which we assume lies under our half-destroyed parameter. As with a lot of things we assume this to be Gaussian, which in our case is a reasonable assumption. Therefore we take the get loc=\mu (2.8 +/- 0.2) and scale=\sigma^2 (1.5+/-0.2) which we had computed earlier:

...    
    len_mu = numpyro.sample("p_length_mu", dist.Normal(2.8, 0.2))
    len_sigma = numpyro.sample("p_length_sigma", dist.Normal(1.5, 0.2))
...

The informed reader clearly sees how the \mu and \sigma themselves are probabilistic parameters for themselves which we model in the process. For both we assume Gaussians but more rigorous distributions could be chosen.
Now we have to find our targets which we model. Hence we localize our missing positions in the vector with numpy.

...    
    len_is_nan = np.isnan(p_len)
    len_nan_idx = np.array(np.isnan(p_len).astype(int)).nonzero()[0]
    
    if target is not None:
        len_impute = numpyro.param("len_impute", np.zeros(len_is_nan.sum()))
    else:
        len_impute = numpyro.sample("len_impute", dist.Normal(len_mu[len_nan_idx], 
                                                              len_sigma[len_nan_idx]))
...

Additionally, we tell NumPyro to make the imputation-value (len_impute) a learnable parameter and fill with as many zeroes as values we are missing.
On a short sidenote: our model has a predictive mode, that occurs when no target (aka species) is provided. When we invoke the predictive capabilities, we sample from the introduced distribution with the sampled \mu and sampled \sigma instead. Make sure to only apply this on the NaN positions in your vector! That is the meaning of the else clause:

numpyro.sample("len_impute", dist.Normal(len_mu[len_nan_idx], len_sigma[len_nan_idx])

We then put the imputed values into the right places and give it the inference sampling, with the observed length values to do its optimization. For this we use the JAX library with some efficient index updating – linear-algebra style.:

...    
   p_len = ops.index_update(p_len, len_nan_idx, len_impute)
  
   numpyro.sample("p_length", dist.Normal(len_mu, len_sigma), obs=p_len)
...

In our length-sampling we include the actual length measures as the observations in the process .

Putting Things Together | Imputed Values into the Linear Model

For this model to be a proper linear model, we have to fit our target y (the species).
Since we have two classes, we do this by invoking a Bernoulli distribution.
We therefore ask in a Bernoulli process whether we are looking at y=0 or y=1 – two classes, two potential outcomes.

The Bayesian Regression – Code

We introduce two parameters, one of which is a bias term for the length parameter. We compose the

...    
    l = numpyro.sample("l", dist.Normal(0, 1))
    b_len = numpyro.sample("b_length", dist.Normal(0, 1))
    logits = l + b_len * p_len
    
    logits = logits + b_s_len*s_len + b_s_width*s_width + b_p_width*p_width
...

Now to the magical part. We use the logit for the Bernoulli, which composes of all the parameters that we put into the model in interaction with their biases. The observed value in this process is now the target-parameter, our y.

   numpyro.sample("target", dist.Bernoulli(logits=logits), obs=target)

The complete NumPyro imputation model

def model(s_len, s_width, p_len, p_width, target=None):
    b_s_len = numpyro.sample("b_s_len", dist.Normal(0,1))
    b_s_width = numpyro.sample("b_s_width", dist.Normal(0,1))
    b_p_width = numpyro.sample("b_p_width", dist.Normal(0,1))
    
    # impute length
    len_mu = numpyro.sample("p_length_mu", dist.Normal(2.8, 0.2))
    len_sigma = numpyro.sample("p_length_sigma", dist.Normal(1.5, 0.2))
    len_is_nan = np.isnan(p_len)
    len_nan_idx = np.array(np.isnan(p_len).astype(int)).nonzero()[0]
    
    if target is not None:
        len_impute = numpyro.param("len_impute", np.zeros(len_is_nan.sum()))
    else:
        len_impute = numpyro.sample("len_impute", dist.Normal(len_mu[len_nan_idx], 
                                                              len_sigma[len_nan_idx]))
    p_len = ops.index_update(p_len, len_nan_idx, len_impute)
    
    numpyro.sample("p_length", dist.Normal(len_mu, len_sigma), obs=p_len)
    
    l = numpyro.sample("l", dist.Normal(0, 1))
    b_len = numpyro.sample("b_length", dist.Normal(0, 1))
    logits = l + b_len * p_len
    
    logits = logits + b_s_len*s_len + b_s_width*s_width + b_p_width*p_width
    if target is None:
        # prediction case
        probs = expit(logits)
        numpyro.sample("probs", dist.Delta(probs))

    numpyro.sample("target", dist.Bernoulli(logits=logits), obs=target)

Run It

To fit the model we use Bayesian Inference utilizing a MCMC method; the NUTS sampler to be specific. The beauty of the NUTS sampler is that it is a hands-off approach, as it adjusts step sizes by itself, given its stepping through the problem-space. We let it have 5000 warm-up or burn-in runs, which are discarded in the process and let it run for 10000 iterations, just to be sure. Our inference finishes with an acceptance probability of 0.96 or 96% – neat!. We then have a look at the summary what has happened:

mcmc = MCMC(NUTS(model=model), 5000, 10000, num_chains=1)
mcmc.run(random.PRNGKey(42), **iris_data)

mcmc.print_summary()
Fig. 3: Model Summary after MCMC indicates convergence and solid R-hat values as metric.

The output clearly shows convergence of the parameters which we sampled in our inference. Specifically the R-hat values are all >= 1.0 .

We then pull the samples and visualize the posterior – we rely on JAX again for the formatting magic:

samples = mcmc.get_samples()
posterior_mu = jnp.expand_dims(samples['l'], -1) + \
               jnp.expand_dims(samples['b_length'], -1)*iris_df["petal length (cm)"].values + \
               jnp.expand_dims(samples['b_s_len'], -1)*iris_df["sepal length (cm)"].values + \
               jnp.expand_dims(samples['b_s_width'], -1)*iris_df["sepal width (cm)"].values + \
               jnp.expand_dims(samples['b_p_width'], -1)*iris_df["petal width (cm)"].values
Fig. 4: Final Bayesian regression over MCMC samples with 90% CI.

There we have it. A clear distinction between our two classes. To be fair there is quite a bit of jitter going on above and below the bounds, but the regression-line could not be nicer.

We have shown that we can fill in the blanks in our model through MCMC-sampling.

For the code-savvy curious reader the final code can be found on my github and in Colab.

What if there are more than two classes, What about Categorical Data?

Great questions, which can be answered with NumPyro’s toolkit as well:

  1. Instead of having a Bernoulli trial in the model you can also use a Multinomial for your Bayesian inference.
  2. We can also use categorical data for model fitting. That looks quite different. Instead of using the one-dimensional array we rely on as many dimensions as there are categories.

A good example to fit with categorical data instead of continuous one is in the excellent kaggle notebook on Age Imputation given other parameters for the Titanic dataset:

https://www.kaggle.com/fehiepsi/bayesian-imputation-for-age

The Dos and Don’ts

Please stick to what you have!

Do not invent data. I cannot stress that enough. It is a very fine line that we are walking here.
If your experiments have generated insufficient data or you had to throw some data out, then imputing values, might not be the best way to go. Even when your data might become presentable that way. When talking to a client or an academic journal you don’t get points for: we simulated parameter X and Y to 85.3% – no matter how nice the model looks.
Please do declare what methods, distributions and, assumptions you used to make the models and imputations for maximum transparency.
One recommendation is that you can always take out the data, fit a model without missing measurements and compare it to a model with imputed values in the end. That can make for a compelling story.

Conclusion

We have walked through data imputation, from the data-frame with holes in it to the final Bayesian inference and fitted posterior distribution.
We have shown how your Bayesian regression can benefit from an imputed vector and what the output looks like including confidence intervals.
Probabilistic Programming can be a powerful tool in your daily work.

Best of luck with your models and analysis!

References

[1] A. Gelman, J.B. Carlin, et. al., Bayesian Data Analysis . Third Edition.
[2] R. McElreath. Statistical Rethinking. 2016. CRC Press.
[3] Uber Technologies. Regression using NumPyro. 2019. NumPyro Documentations.
[4] M. Betancourt. A Conceptual Introduction to Hamiltonian Monte Carlo. 2020 on ArXiv.
[5] CS 4110 – Programming Languages and Logics. Probabilistic Programming. 2016. Cornell University.

The Bayesian Toolkit | 3 Probabilistic Frameworks

Account for uncertainties in your programs and pipelines or build probabilistic models.

The tools to build, train and tune your probabilistic models. Photo by Patryk Grądys on Unsplash.

We should always aim to create better Data Science workflows.
But in order to find out how to achieve that we should find out what is lacking.

Classical ML workflows are missing something

Classical Machine Learning is pipelines work great. The usual workflow looks like this:

  1. Have a use-case or research question
  2. build and curate a dataset that relates to the use-case or research question
  3. build a model
  4. train and validate the model
    1. maybe even cross-validate, while grid-searching hyperparameters.
  5. test the fitted model
  6. deploy model for the use-case or answer the research question

As you might have noticed, one severe shortcoming is to account for certainties of the model and confidence over the output.

Certain about being Uncertain

After going through this workflow and given that the model results looks sensible, we take the output for granted. So what is missing?
In this basic approach we have not accounted for missing or shifted data.
Some might interject and say that they have some augmentation routine for their data. That’s great – but did you formalize it?
What about building a prototype before having seen the data – something like a modeling sanity check? Simulate some data and build a prototype before you invest resources in gathering data and fitting insufficient models. This was already pointed out by Andrew Gelman in his Keynote at the NY PyData Keynote 2017.
Get better intuition and parameter insights! For deep-learning models you need to rely on a platitude of tools and plotting libraries to explain what your model has learned.
For probabilistic approaches you can get insights on parameters quickly.
So what tools do we want to use in a production environment?

I. STAN – The Statisticians Choice

STAN is a well established framework and tool for research. Strictly speaking, this framework has its own probabilistic language and the Stan-code looks more like a statistical formulation of the the model you are fitting.
Once you have built and done inference with your model you save everything to file, which brings the great advantage that everything is reproducible.
STAN is well supported in R through RStan, Python with PyStan, and other interfaces.
In the background the framework compiles the model into efficient C++ code.
In the end, the computation is done through MCMC Inference (e.g. NUTS sampler) which is easily accessible and even Variational Inference is supported.
If you want to get started with this Bayesian approach we recommend the case-studies.

II. Pyro – The Programming Approach

My personal favorite tool for deep probabilistic models is Pyro. This language was developed and is maintained by the Uber Engineering division. The framework is backed by PyTorch so that the modeling that you are doing integrates seamlessly with the PyTorch models which you might already have.
Writing your models and training writes like any other Python code with some special rules and formulations that come with the probabilistic approach.

As an overview we have already compared STAN and Pyro Modeling on a small problem-set in a previous post: http://laplaceml.com/single-parameter-models/ .

Pyro excels, when you want to find randomly distributed parameters, sample data and perform efficient inference.
As this language is under constant development not everything you are working on might be documented. There are a lot of use-cases and already existing model-implementations and examples. Also the documentation gets better by the day.
The examples and tutorials are a good place to start, especially when you are new to the field of probabilistic Programming and statistical modeling.

III. TensorFlow Probability – Google’s Favorite

When you talk Machine Learning, especially deep learning, many people think TensorFlow. Since TensorFlow is backed by Google developers you can be certain, that it is well maintained and has excellent documentation.
When you have TensorFlow or better yet TF2 in your workflows already you are all set to use TF Probability also.
Josh Dillon made an excellent case why probabilistic modeling is worth the learning curve and why you should consider TensorFlow Probability at the Tensorflow Dev Summit 2019:

TensorFlow Probability: Learning with confidence (TF Dev Summit ’19) by TensorFlow Channel

And here is a short Notebook to get you started on writing Tensorflow Probability Models:

https://colab.research.google.com/github/tensorflow/probability/blob/master/tensorflow_probability/g3doc/_index.ipynb

Honorable Mentions

PyMC3 is an openly available python probabilistic modeling API. It has vast application in research, has great community support and you can find a number of talks on probabilistic modeling on youtube to get you started.

If you are programming Julia, take a look at Gen. This is also openly available and in very early stages. So documentation is still lacking and things might break. Anyhow it appears to be an exciting framework. If you are happy to experiment, the publications and talks so far have been very promising.

References

[1] Paul-Christian Bürkner. brms: An R Package for Bayesian Multilevel Models Using Stan
[2] B. Carpenter, A. Gelman, et al. STAN: A Probabilistic Programming Language
[3] E. Bingham, J. Chen, et al. Pyro: Deep Universal Probabilistic Programming

Single-Parameter Models | Pyro vs. STAN

Photo by Joey Csunyo on Unsplash

Modeling U.S. cancer-death rates with two Bayesian approaches: MCMC in STAN and SVI in Pyro.

Single parameter models are an excellent way to get started with the topic of probabilistic modeling. These models comprise of one parameter that influences our observation and which we can infer from the given data. In this article we look at the performance and compare two well established frameworks – the Statistical Language STAN and the Pyro Probabilistic Programming Language.

Kidney Cancer Data

One old and established dataset is the cases of kidney cancer in the U.S. from 1980-1989, which is available here (see [1]). Given are U.S. counties, their total population and the cases of reported cancer-deaths.
Our task is to infer the rate of death from the given data in a Bayesian way.
An elaborate walk-through of the task can be found in section 2.8 of “Bayesian Data Analysis 3” [1].
Our dataframe looks like this:

We have the total number of datapoints N (all rows in the dataframe) and per county we have the observed death-count (dc) which we refer to as y and the population (pop), which we call n later.
Given that we have epidemiological data, we think that the Poisson distribution gives a good basis for estimating the rate that we want to compute. Therefore our observations y are sampled from a Poisson distribution. The interesting parameter is the \lambda for our Poisson, which we call rate. This death-rate comes from a Gamma distribution (see explanations in Chapter 2 [1]). A very short explanation why we use a Gamma here is that it is a conjugate prior with respect to Poisson.
Though there is more to the special relationship that the distributions have.

Putting it together we arrive at the formulation for the observations of death-cases:

and for the wanted single parameter rate:

Now that we have seen the tasks lets fire up our tools and get to work.

STAN – the classical statistical models

The statistician’s well established work-horse is the stats language STAN.
You write your problem as model-code and STAN will compile an efficient C++ model under the hood. From a compiled model you can sample and perform inference.
The Framework offers different interfaces for example in Python (PyStan), R (RStan), Julia and others. For this article we will use PyStan so that we can have both models neatly side by side in a notebook.
We start to define the data that we have given in STAN. This means the integer N for the total size of the data set, y the observed deaths and n the population per county. So that we have each N times.

kidney_cancer_code="""
data {
   int N; // total observations
   int y[N]; // observed death-count
   vector[N] n; // population
}
...

We already know that a rate parameter like the death-rate \theta is bounded in [0, 1]. That means you cannot have a rate less than 0 or higher than 1.
This \theta parameter is then basis for the actual rate – that accounts for a counties population. Thus, since we transform our parameter we put it into the transformed parameters bracket.:

...
parameters {
   vector<lower=0,upper=1>[N] theta; // bounded deathrate estimate
}

transformed parameters {
   vector[N] rate=n .* theta;
}
...

Now to the magical inference part, which is what the posterior is all about. We sample our \theta from a Gamma distribution. The \alpha (=20) and \beta (=430000) are given in [1], but one could easily compute them from the underlying dataset. Notice, that the model actually takes the transformed parameter and not just the \theta. This will become apparent in the output later.

...
model {
   theta ~ gamma(20,430000);
   y ~ poisson(rate);
}
"""

Now that we have a STAN Model as a string, we need the data in the right format. That means arrays of integers from the data, like so:

dc = data["dc"].to_numpy().astype(int)
pop = data['pop'].to_numpy().astype(int)
kidney_cancer_dat = {'N': N,
                    'y': dc[0:N],
                    'n': pop[0:N]}

Since everything is set-up we can use PyStan to convert our Model to C++ code and use sampling to perform inference. The inference procedure is Markov Chain Monte Carlo or MCMC. The necessary parameters are the amount of samples that we take (the iterations) and the amount of chains over which we draw the samples. Each chain is an ordered sequence of draws.

sm = pystan.StanModel(model_code=kidney_cancer_code)
fit = sm.sampling(data=kidney_cancer_dat, iter=5000, chains=4)

After the fitting is done, the first thing to check is if the model has converged. That means that \hat{R} is >= 1 or we can just ask the diagnostics tools:

stan_utility.check_all_diagnostics(fit)

### the below lines are output:
n_eff / iter looks reasonable for all parameters
Rhat looks reasonable for all parameters
0.0 of 10000 iterations ended with a divergence (0.0%)
0 of 10000 iterations saturated the maximum tree depth of 10 (0.0%)
E-BFMI indicated no pathological behavior

We can also see that the chains have converged (right plots) from the traces of the performed inference:

az.plot_trace(fit,var_names=['theta'])
Fig. 1 – Traceplot of the \theta posterior computation. The chains on the right side display convergence. Each chain has a density function for theta (left side), which display agreement over the values.

The chains (Fig. 1 right) should look like “caterpillars madly in love” and your diagnostics look good. So we know that our model has converged. We can now see the inferred theta values and can look at the individual densities:

az.plot_density(fit, var_names=["theta"])
Fig. 2 – Posterior densities on the fitted single parameter \theta after inference.
Warning: Do not use the complete dataset as input! This will lead to errors,
we went with N=25 samples for testing purposes. Feel free to test how many samples it takes to break PyStan (or STAN).

For doing Bayesian regression task there are also different modules that let you skip the “write-an-elaborate-statistical-model” part. BRMS in R is one of those excellent tools.

Pyro – the probabilistic programming approach

Going to a completely different paradigm is Pyro. Instead of performing MCMC through sampling we treat our task as an optimization problem.
For this we formulate a model, which computes our posterior (p). Additionally we also formulate a so-called guide which gives us a parameterized distribution (q), which is used for the model-fitting procedure.
In a previous SVI article we had already looked at how this optimization works and recommend a short recap.

The highly unintuitive part to this is that instead of simply fitting one parameter we comprise it of four Pyro parameters.:

  1. Firstly, \alpha and \beta are constants that give our Gamma distribution its shape like we did in the STAN part,
  2. We add two boring trainable parameters p1 and p2, which get allow for optimization in the SVI steps. Both parameters are positive – hence constraint=constraints.positive ,
  3. When going through the data, the observations themselves are independent events.
    This is done through Pyro’s plate modeling, which also supports subsampling from our dataset. A nice introduction to this can be found in [2].
ϵ = 10e-3
def model(population, deathcount):
    α = pyro.param('α', torch.tensor(20.))
    β = pyro.param('β', torch.tensor(430000.))
    p1= pyro.param('p1', torch.ones(data.shape[0]), constraint=constraints.positive)
    p2 = pyro.param('p2', torch.ones(data.shape[0]), constraint=constraints.positive)
    with pyro.plate('data', data.shape[0], subsample_size=32) as idx:
        n_j = population[idx]
        y_j = deathcount[idx]
        α_j = α + p1[idx]*y_j
        β_j = β + p2[idx]*n_j
        θ_j = pyro.sample("θ", Gamma(α_j, β_j))
        λ_j = 10*n_j*θ_j + ϵ
        pyro.sample('obs', Poisson(λ_j), obs=y_j)

The model performs the sampling of the observations given the Poisson distribution. This is comparable to what STAN was doing, with the difference that the parameters which make up \lambda and the underlying distributions are now trainable Pyro parameters.
In order for our model to not go up in flames we have to add a small number \epsilon to the rate otherwise Poisson will be unstable:

λ_j = 10*n_j*θ_j + ϵ
This is not the best way to model this task, but it is closest to the STAN model. One can find a model and guide that do not rely on y and n for computing alpha and \beta.

Now the guide gives us a parameterized distribution q with which we can perform optimization on minimizing the Evidence Lower bound or ELBO. The parameters are trainable Pyro parameters, which we have already seen in the model.

def guide(population, deathcount):
    α = pyro.param('α', torch.tensor(20.))
    β = pyro.param('β', torch.tensor(430000.))
    p1 = pyro.param('p1', torch.ones(data.shape[0]), constraint=constraints.positive)
    p2 = pyro.param('p2', torch.ones(data.shape[0]), constraint=constraints.positive)
    with pyro.plate('data', data.shape[0], subsample_size=32) as idx:
        n_j = population[idx]
        y_j = deathcount[idx]
        α_j = α + p1[idx]*y_j
        β_j = β + p2[idx]*n_j
        θ_j = pyro.sample("θ", Gamma(α_j, β_j))

Now we can run our SVI like so:

svi = SVI(model, guide, Adam({'lr': 0.025}), JitTrace_ELBO(3))
for i in tqdm(range(1000)):
    loss = svi.step(population, deathcount)
    print(f"Loss: {loss}")

When we plot the loss, we can see that the model improves over time. The lower the bound the better. The loss is the ELBO trace over the iterations.

Fig. 3 – SVI loss over iteration. Loss is the ELBO returned from the SVI step over 1000 iterations.

Now for the final check we can see that we have fitted appropriate parameters.
The \alpha and \beta variables make up for a good underlying Gamma distribution for our posterior inference.

Conclusion – What approach is best?

STAN has a straight-forward way to reason about the model. If the mathematical formulation of a posterior is known, it can be very straight forward to implement a model.
However STAN and its MCMC sampling have their limitations. Under default configuration it was not possible to run our model over all of the data.
Pyro does an excellent job at handling larger datasets efficiently and performing Variational Inference. As a Probabilistic Programming Language it can be written just like any other Python code. The model plus guide are also informative and encapsulate the problem well. Through this we transform the posterior computation into an optimization task and get sensible outputs.

Now that you are familiar with single-parameter models have fun working with bigger, more complex tasks. The Pyro Examples are a good place to start.

Happy inferring!

References

[1] A. Gelman, J.B. Carlin, et. al., Bayesian Data Analysis . Third Edition.
[2] Pyro Documentation – SVI part II

Pyro Top-Down Forecasting | Application-case

Connect the dots over time and forecast with confidence(-intervals).

Connecting dots across time – photo by israel palacio on Unsplash

Have you ever wondered how to account for uncertainties in time-series forecasts?
Have you ever thought there should be a way to generate data from previously seen data and make judgement calls about certainties? I know we have.
If you want to build models that capture probabilities and hold confidences we recommend using a probabilistic programming framework like Pyro.
In a previous article we have looked at NGBoosting and have applied it to the M5 forecasting challenge on Kaggle. As a quick recap – the M5 forecasting challenge asks us to predict how the sales of Walmart items will develop over time. It provides around 4-5 years of data for items of different categories from different stores across different states and asks us to forecast 28 days that we have no information about. As an overview over the challenge and data-set we still recommend this amazing notebook. Last time we concluded, that we got better results with Pyro and here is a simple walk-through of how we got there.

There are different approaches to modeling and forecasting data over time. There are top-down models, state-space models and hierarchical models – to name a selected few.
In this article we see how a prediction can be done through a very rough and rudimentary top-down model. We make thorough use of the existing Forecaster object in Pyro. We just tell the model how to make an informed prediction. After that we dump all the data into it. That’s it – simple enough! No fancy priors, no fancy underlying distribution-assumptions, just data and a probabilistic framework. Some assumptions we still have to make and we will walk you through this.

Sidenote: One more elegant way to do forecasting is with hierarchical models. These models allow accounting for different distributions over different categories in your data. You can assign individual priors and make use of information that you have about your model. The elegance of such an approach will be covered in another post.

In the end, we dump all our time-series information that we have into the model, which is sales over time, while preserving a sale as an independent random event . We don’t use anything else. If the Pyro programming framework, specifically the ForecastingModel is the engine to our model we also need a driver.
The element that drives our model in the right direction is this:

prediction = regressor + trend + seasonal + motion + bias

This simple line combines Pyro objects from which we sample over time. Now, lets look at what all those parts mean:

Composing the model

The actual model part is captured by the regressor. This piece looks for proper model-weights and throws them against the features which come directly from the data-input. Each weight lies on a classical Gaussian curve. This means we sample from a normal distribution centered around zero with a standard deviation of 1 (\mu=0,\sigma^2=1).

 weight = pyro.sample("weight", dist.Normal(0, 1).expand(
        [feature.size(-1)]).to_event(1))
 regressor = (weight * feature).sum(-1)

The second parameter is the trend. We consult our probabilisitc looking-glass and then decide it should come from a Log-Normal distribution, such that l\sim\mathcal{LN} with \mu=-2 and \sigma^2=1. A key element is that we account for the time-feature that we provide the model with.

trend_coef = pyro.sample("trend", dist.LogNormal(-2, 1))
trend = trend_coef * time

Now what is a trend without a season? Trends come and got, but the seasons are what we can count on. We know that the underlying data describes purchases over time. We think that a change in sales is captured by what day it is during the week. Therefore we say

with pyro.plate("day_of_week", 7, dim=-1):
    seasonal = pyro.sample("seasonal", dist.Normal(0, 5))
seasonal = periodic_repeat(seasonal, duration, dim=-1)

Here we make use of another neat Pyro object: the plate. This construct is used to model independence of events and by event we mean to make an observation of an instance occurring. This observation comes from each time-step of our data. A thorough explanation can be found in the Pyro SVI tutorials. For our purposes we break it down and say that we think that the data rises and falls within a seven day window. We translate this to independently sampled events. Last but not least there is not just one seven day interval, but we periodically repeat this over all of our given time-input aka. our duration is the size of our time-tensor.

Lastly, every professional ML-model needs some trainable parameter that accounts for the fact that sometimes things are just not as nice as you expect them to be – we call this a bias. We decide for our bias b that it comes from a normal distribution, such that b\sim\mathcal{N} where \mu=0 and \sigma^2=10 or loc=0 and scale=10 if you want to talk Pyro.

bias = pyro.sample("bias", dist.Normal(0, 10))

Let’s take a look at what the distributions of our parameters – before we continue.

Fig. 1 – Density Plots over all sampled parameters by color. Y-axis is clipped and the max for the trend log-normal distribution is >2.5 .

You, the attentive reader, might conclude at this point. You have everything you need and we agree. What we have now is sufficient to give us a good estimate and make projections for a couple of days into the future.
We can make it better though. The prediction so far, all by itself is a really stiff expression. The model is told to fit the input but we don’t allow for leeway.
Let’s assume, for the sake of argument, that all people in California decide they rather go shopping on a Thursday afternoon instead of a Saturday; or a crisis occurs and everybody goes to prepare for the apocalypse. Our model might have correctly predicted all previous Saturdays, but that one Thursday really throws it off. To account for that we introduce something called reparameterization. For an extensive overview of this topic see [2].
To make it even more robust we do a reparameterization through another reparameterization – why we do this is explained here.
Packing everything together we get a drift which defined our motion. And this motion is the “m” parameter in our model. This translates to Pyro as such:

drift_stability = pyro.sample("drift_stability", dist.Uniform(1, 2))
drift_scale = pyro.sample("drift_scale", dist.LogNormal(-17, 5))
with self.time_plate:
   with poutine.reparam(config={"drift": LocScaleReparam()}):
      with poutine.reparam(config={"drift": SymmetricStableReparam()}):
         drift = pyro.sample("drift", dist.Stable(drift_stability, 0, drift_scale))
motion = drift.cumsum(dim=-1)

Now we have all we need for our model to work.

Your model at work

For the actual model-fitting part we use Pyro’s Stochastic Variational Inference engine or SVI . This algorithm allows us to efficiently compute our posterior distribution in a reasonable amount of time. In our case this is done to maximize the Evidence Lower Bound or ELBO. One simplified way to think about ELBO is to observe two different distributions, a current one and a previous one. If the distributions are wildly different we get a low ELBO value. On the other hand if I have fitted a good distribution my next one might be just as nice, aligns well with the already observed one and I get a high ELBO value. Some might even say that the divergence of both models is minimized . To read about it in Detail we refer to the original paper for SVI (see [3]). For now lets assume this just works because smart researchers have implemented it sufficiently.

We will go into the interplay of Stochastic Variational Inference and ELBO (or minimizing KL) in another article. 

Now that we have put together a working model, have selected our algorithm to compute our distributions, we have to set the hyperparameters, load the data and start training. For data-handling we extensively attend to the M5 Starter Kit implementation found on github, which the Pyro PPL team provides generously. This is an easy way to get the aggregates over e.g. the sales data. Feel free to implement this yourself. Its a sure way to learn your way around tensors. For our beginner model we only care about the sales data over time.
We now set our parameters and instantiate the forecaster as follows:

forecaster_opt = {
    "learning_rate": 0.1,
    "learning_rate_decay": 0.1,
    "clip_norm": 10,
    "num_steps": 3501,
    "log_every": 100,
}
forecaster = Forecaster(TopDownModel(), data, covariates[:-28], **forecaster_opt)
samples = forecaster(data, covariates, num_samples=1000).exp().squeeze(-1).cpu()
pred = samples.mean(0)

In the above code we fit the forecaster which under the hood uses a DCTAdam optimizer. We provide the covariates for the training and withhold 28 days, which we use for the actual forecast in the line below. The actual prediction is then the mean across the samples, which will be used for our kaggle submission. Lets not get ahead of ourselves here. When we run the above codes and have implemented our model correctly we should get the following output – the actual training:

Forecaster training process with the specified model over all the training data for 3501 steps. The loss is minimized over time.

We see that our loss decreases. This is good and tells us that the computed distributions from our prediction get better over time. The loss here is a quotient of the negative ELBO with respect to the data and part of the Forecaster documentation (see [4]).

The Real Uncertain Thing – with confidence and everything

We now hold a final model in our memory. We can compute losses with it, throw it against new data, poke it and see what it generates.
For example, we can sample sales over time as you can see in the image below:

Fig. 2 – Real data and data sampled from the fitted model over time. The last available (truth) 14 days are depicted by the red line, whereas the sampled-predictions by the model are light green. We can now forecast the required 28 days with the mean (dark-green) that serves as prediction. A confidence interval from 10%-90% for each time-step is displayed with red shading.

To compute the confidence of the individual steps we ask our model for 100 samples and compute quantiles (0.1 and 0.9) for those samples. This is the red area that you can see in Fig. 2 . The sales in the figure are aggregated logged-values for better model-fitting. To get values in a more realistic range transform the values using numpy’s exp.

To Summarize

We have shown how you can build a model. Use the facts that you know about the problem to set the model parameters and find appropriate distributions. A lot of functionality comes with Pyro, such as SVI an algorithm to compute the posterior, which makes the model-training possible in the first-place. After we have fitted our model we can sample from it, forecast datapoints and compute confidences on our predictions.

Complete Model Code

class TopDownModel(ForecastingModel):
  """
  Top-Down Hierarchical Forecasting Model
  """
  def model(self, zero_data, covariates):
    # check univariate data
    assert zero_data.size(-1) == 1 
    duration = zero_data.size(-2)

    time, feature = covariates[..., 0], covariates[..., 1:]

    bias = pyro.sample("bias", dist.Normal(0, 10))
    trend_coef = pyro.sample("trend", dist.LogNormal(-2, 1))
    trend = trend_coef * time

    weight = pyro.sample("weight", dist.Normal(0, 1).expand(
        [feature.size(-1)]).to_event(1))
    regressor = (weight * feature).sum(-1)

    # weekly seasonality as independent events
    with pyro.plate("day_of_week", 7, dim=-1):
      seasonal = pyro.sample("seasonal", dist.Normal(0, 5))
    seasonal = periodic_repeat(seasonal, duration, dim=-1)

    drift_stability = pyro.sample("drift_stability", dist.Uniform(1, 2))
    drift_scale = pyro.sample("drift_scale", dist.LogNormal(-17, 5))

    # introduce drift
    with self.time_plate:
          # We combine two different reparameterizers: the inner SymmetricStableReparam
          # is needed for the Stable site, and the outer LocScaleReparam improves inference.
          with poutine.reparam(config={"drift": LocScaleReparam()}):
              with poutine.reparam(config={"drift": SymmetricStableReparam()}):
                  drift = pyro.sample("drift",
                                      dist.Stable(drift_stability, 0, drift_scale))
    motion = drift.cumsum(dim=-1)

    # predict
    prediction = regressor + trend + seasonal + motion + bias
    # Pyro Forecast is multivariate - univariate timeseries is needed
    prediction = prediction.unsqueeze(-1)

    # heavy tail nose to account for outliers
    stability = pyro.sample("noise_stability", dist.Uniform(1, 2).expand([1]).to_event(1))
    skew = pyro.sample("noise_skew", dist.Uniform(-1, 1).expand([1]).to_event(1))
    scale = pyro.sample("noise_scale", dist.LogNormal(-5, 5).expand([1]).to_event(1))
    noise_dist = dist.Stable(stability, skew, scale)
    with poutine.reparam(config={"residual": StableReparam()}):
      self.predict(noise_dist, prediction)

Disclaimer

A big thanks to the Pyro documentation and the development team. The comprehensive writing on the topics and the implementations on GitHub really enable Pyro users to hit the ground running. Our model was a wild experimental mix-up of different models available and we tested what parts work well together.


References

  1. Pyro Documentation – Tutorial on Forecasting I , II and III
  2. M. Gorinova, D. Moore, M. Hoffman. Automatic Reparameterisation of Probabilistic Programs. 2019 published on ArXiv
  3. M.Hoffman, et al. in Stochastic Variational Inference. 2013. in JMLR
  4. Uber Technologies. 2018. Pyro Forecaster see Documentation