Account for uncertainties in your programs and pipelines or build probabilistic models.
We should always aim to create better Data Science workflows.
But in order to find out how to achieve that we should find out what is lacking.
Classical ML workflows are missing something
Classical Machine Learning is pipelines work great. The usual workflow looks like this:
- Have a use-case or research question
- build and curate a dataset that relates to the use-case or research question
- build a model
- train and validate the model
- maybe even cross-validate, while grid-searching hyperparameters.
- test the fitted model
- deploy model for the use-case or answer the research question
As you might have noticed, one severe shortcoming is to account for certainties of the model and confidence over the output.
Certain about being Uncertain
After going through this workflow and given that the model results looks sensible, we take the output for granted. So what is missing?
In this basic approach we have not accounted for missing or shifted data.
Some might interject and say that they have some augmentation routine for their data. That’s great – but did you formalize it?
What about building a prototype before having seen the data – something like a modeling sanity check? Simulate some data and build a prototype before you invest resources in gathering data and fitting insufficient models. This was already pointed out by Andrew Gelman in his Keynote at the NY PyData Keynote 2017.
Get better intuition and parameter insights! For deep-learning models you need to rely on a platitude of tools and plotting libraries to explain what your model has learned.
For probabilistic approaches you can get insights on parameters quickly.
So what tools do we want to use in a production environment?
I. STAN – The Statisticians Choice
STAN is a well established framework and tool for research. Strictly speaking, this framework has its own probabilistic language and the Stan-code looks more like a statistical formulation of the the model you are fitting.
Once you have built and done inference with your model you save everything to file, which brings the great advantage that everything is reproducible.
STAN is well supported in R through RStan, Python with PyStan, and other interfaces.
In the background the framework compiles the model into efficient C++ code.
In the end, the computation is done through MCMC Inference (e.g. NUTS sampler) which is easily accessible and even Variational Inference is supported.
If you want to get started with this Bayesian approach we recommend the case-studies.
II. Pyro – The Programming Approach
My personal favorite tool for deep probabilistic models is Pyro. This language was developed and is maintained by the Uber Engineering division. The framework is backed by PyTorch so that the modeling that you are doing integrates seamlessly with the PyTorch models which you might already have.
Writing your models and training writes like any other Python code with some special rules and formulations that come with the probabilistic approach.
As an overview we have already compared STAN and Pyro Modeling on a small problem-set in a previous post: http://laplaceml.com/single-parameter-models/ .
Pyro excels, when you want to find randomly distributed parameters, sample data and perform efficient inference.
As this language is under constant development not everything you are working on might be documented. There are a lot of use-cases and already existing model-implementations and examples. Also the documentation gets better by the day.
The examples and tutorials are a good place to start, especially when you are new to the field of probabilistic Programming and statistical modeling.
III. TensorFlow Probability – Google’s Favorite
When you talk Machine Learning, especially deep learning, many people think TensorFlow. Since TensorFlow is backed by Google developers you can be certain, that it is well maintained and has excellent documentation.
When you have TensorFlow or better yet TF2 in your workflows already you are all set to use TF Probability also.
Josh Dillon made an excellent case why probabilistic modeling is worth the learning curve and why you should consider TensorFlow Probability at the Tensorflow Dev Summit 2019:
And here is a short Notebook to get you started on writing Tensorflow Probability Models:
PyMC3 is an openly available python probabilistic modeling API. It has vast application in research, has great community support and you can find a number of talks on probabilistic modeling on youtube to get you started.
If you are programming Julia, take a look at Gen. This is also openly available and in very early stages. So documentation is still lacking and things might break. Anyhow it appears to be an exciting framework. If you are happy to experiment, the publications and talks so far have been very promising.
 Paul-Christian Bürkner. brms: An R Package for Bayesian Multilevel Models Using Stan
 B. Carpenter, A. Gelman, et al. STAN: A Probabilistic Programming Language
 E. Bingham, J. Chen, et al. Pyro: Deep Universal Probabilistic Programming