Pyro Top-Down Forecasting | Application-case

Connect the dots over time and forecast with confidence(-intervals).

Connecting dots across time – photo by israel palacio on Unsplash

Have you ever wondered how to account for uncertainties in time-series forecasts?
Have you ever thought there should be a way to generate data from previously seen data and make judgement calls about certainties? I know we have.
If you want to build models that capture probabilities and hold confidences we recommend using a probabilistic programming framework like Pyro.
In a previous article we have looked at NGBoosting and have applied it to the M5 forecasting challenge on Kaggle. As a quick recap – the M5 forecasting challenge asks us to predict how the sales of Walmart items will develop over time. It provides around 4-5 years of data for items of different categories from different stores across different states and asks us to forecast 28 days that we have no information about. As an overview over the challenge and data-set we still recommend this amazing notebook. Last time we concluded, that we got better results with Pyro and here is a simple walk-through of how we got there.

There are different approaches to modeling and forecasting data over time. There are top-down models, state-space models and hierarchical models – to name a selected few.
In this article we see how a prediction can be done through a very rough and rudimentary top-down model. We make thorough use of the existing Forecaster object in Pyro. We just tell the model how to make an informed prediction. After that we dump all the data into it. That’s it – simple enough! No fancy priors, no fancy underlying distribution-assumptions, just data and a probabilistic framework. Some assumptions we still have to make and we will walk you through this.

Sidenote: One more elegant way to do forecasting is with hierarchical models. These models allow accounting for different distributions over different categories in your data. You can assign individual priors and make use of information that you have about your model. The elegance of such an approach will be covered in another post.

In the end, we dump all our time-series information that we have into the model, which is sales over time, while preserving a sale as an independent random event . We don’t use anything else. If the Pyro programming framework, specifically the ForecastingModel is the engine to our model we also need a driver.
The element that drives our model in the right direction is this:

prediction = regressor + trend + seasonal + motion + bias

This simple line combines Pyro objects from which we sample over time. Now, lets look at what all those parts mean:

Composing the model

The actual model part is captured by the regressor. This piece looks for proper model-weights and throws them against the features which come directly from the data-input. Each weight lies on a classical Gaussian curve. This means we sample from a normal distribution centered around zero with a standard deviation of 1 (\mu=0,\sigma^2=1).

 weight = pyro.sample("weight", dist.Normal(0, 1).expand(
 regressor = (weight * feature).sum(-1)

The second parameter is the trend. We consult our probabilisitc looking-glass and then decide it should come from a Log-Normal distribution, such that l\sim\mathcal{LN} with \mu=-2 and \sigma^2=1. A key element is that we account for the time-feature that we provide the model with.

trend_coef = pyro.sample("trend", dist.LogNormal(-2, 1))
trend = trend_coef * time

Now what is a trend without a season? Trends come and got, but the seasons are what we can count on. We know that the underlying data describes purchases over time. We think that a change in sales is captured by what day it is during the week. Therefore we say

with pyro.plate("day_of_week", 7, dim=-1):
    seasonal = pyro.sample("seasonal", dist.Normal(0, 5))
seasonal = periodic_repeat(seasonal, duration, dim=-1)

Here we make use of another neat Pyro object: the plate. This construct is used to model independence of events and by event we mean to make an observation of an instance occurring. This observation comes from each time-step of our data. A thorough explanation can be found in the Pyro SVI tutorials. For our purposes we break it down and say that we think that the data rises and falls within a seven day window. We translate this to independently sampled events. Last but not least there is not just one seven day interval, but we periodically repeat this over all of our given time-input aka. our duration is the size of our time-tensor.

Lastly, every professional ML-model needs some trainable parameter that accounts for the fact that sometimes things are just not as nice as you expect them to be – we call this a bias. We decide for our bias b that it comes from a normal distribution, such that b\sim\mathcal{N} where \mu=0 and \sigma^2=10 or loc=0 and scale=10 if you want to talk Pyro.

bias = pyro.sample("bias", dist.Normal(0, 10))

Let’s take a look at what the distributions of our parameters – before we continue.

Fig. 1 – Density Plots over all sampled parameters by color. Y-axis is clipped and the max for the trend log-normal distribution is >2.5 .

You, the attentive reader, might conclude at this point. You have everything you need and we agree. What we have now is sufficient to give us a good estimate and make projections for a couple of days into the future.
We can make it better though. The prediction so far, all by itself is a really stiff expression. The model is told to fit the input but we don’t allow for leeway.
Let’s assume, for the sake of argument, that all people in California decide they rather go shopping on a Thursday afternoon instead of a Saturday; or a crisis occurs and everybody goes to prepare for the apocalypse. Our model might have correctly predicted all previous Saturdays, but that one Thursday really throws it off. To account for that we introduce something called reparameterization. For an extensive overview of this topic see [2].
To make it even more robust we do a reparameterization through another reparameterization – why we do this is explained here.
Packing everything together we get a drift which defined our motion. And this motion is the “m” parameter in our model. This translates to Pyro as such:

drift_stability = pyro.sample("drift_stability", dist.Uniform(1, 2))
drift_scale = pyro.sample("drift_scale", dist.LogNormal(-17, 5))
with self.time_plate:
   with poutine.reparam(config={"drift": LocScaleReparam()}):
      with poutine.reparam(config={"drift": SymmetricStableReparam()}):
         drift = pyro.sample("drift", dist.Stable(drift_stability, 0, drift_scale))
motion = drift.cumsum(dim=-1)

Now we have all we need for our model to work.

Your model at work

For the actual model-fitting part we use Pyro’s Stochastic Variational Inference engine or SVI . This algorithm allows us to efficiently compute our posterior distribution in a reasonable amount of time. In our case this is done to maximize the Evidence Lower Bound or ELBO. One simplified way to think about ELBO is to observe two different distributions, a current one and a previous one. If the distributions are wildly different we get a low ELBO value. On the other hand if I have fitted a good distribution my next one might be just as nice, aligns well with the already observed one and I get a high ELBO value. Some might even say that the divergence of both models is minimized . To read about it in Detail we refer to the original paper for SVI (see [3]). For now lets assume this just works because smart researchers have implemented it sufficiently.

We will go into the interplay of Stochastic Variational Inference and ELBO (or minimizing KL) in another article. 

Now that we have put together a working model, have selected our algorithm to compute our distributions, we have to set the hyperparameters, load the data and start training. For data-handling we extensively attend to the M5 Starter Kit implementation found on github, which the Pyro PPL team provides generously. This is an easy way to get the aggregates over e.g. the sales data. Feel free to implement this yourself. Its a sure way to learn your way around tensors. For our beginner model we only care about the sales data over time.
We now set our parameters and instantiate the forecaster as follows:

forecaster_opt = {
    "learning_rate": 0.1,
    "learning_rate_decay": 0.1,
    "clip_norm": 10,
    "num_steps": 3501,
    "log_every": 100,
forecaster = Forecaster(TopDownModel(), data, covariates[:-28], **forecaster_opt)
samples = forecaster(data, covariates, num_samples=1000).exp().squeeze(-1).cpu()
pred = samples.mean(0)

In the above code we fit the forecaster which under the hood uses a DCTAdam optimizer. We provide the covariates for the training and withhold 28 days, which we use for the actual forecast in the line below. The actual prediction is then the mean across the samples, which will be used for our kaggle submission. Lets not get ahead of ourselves here. When we run the above codes and have implemented our model correctly we should get the following output – the actual training:

Forecaster training process with the specified model over all the training data for 3501 steps. The loss is minimized over time.

We see that our loss decreases. This is good and tells us that the computed distributions from our prediction get better over time. The loss here is a quotient of the negative ELBO with respect to the data and part of the Forecaster documentation (see [4]).

The Real Uncertain Thing – with confidence and everything

We now hold a final model in our memory. We can compute losses with it, throw it against new data, poke it and see what it generates.
For example, we can sample sales over time as you can see in the image below:

Fig. 2 – Real data and data sampled from the fitted model over time. The last available (truth) 14 days are depicted by the red line, whereas the sampled-predictions by the model are light green. We can now forecast the required 28 days with the mean (dark-green) that serves as prediction. A confidence interval from 10%-90% for each time-step is displayed with red shading.

To compute the confidence of the individual steps we ask our model for 100 samples and compute quantiles (0.1 and 0.9) for those samples. This is the red area that you can see in Fig. 2 . The sales in the figure are aggregated logged-values for better model-fitting. To get values in a more realistic range transform the values using numpy’s exp.

To Summarize

We have shown how you can build a model. Use the facts that you know about the problem to set the model parameters and find appropriate distributions. A lot of functionality comes with Pyro, such as SVI an algorithm to compute the posterior, which makes the model-training possible in the first-place. After we have fitted our model we can sample from it, forecast datapoints and compute confidences on our predictions.

Complete Model Code

class TopDownModel(ForecastingModel):
  Top-Down Hierarchical Forecasting Model
  def model(self, zero_data, covariates):
    # check univariate data
    assert zero_data.size(-1) == 1 
    duration = zero_data.size(-2)

    time, feature = covariates[..., 0], covariates[..., 1:]

    bias = pyro.sample("bias", dist.Normal(0, 10))
    trend_coef = pyro.sample("trend", dist.LogNormal(-2, 1))
    trend = trend_coef * time

    weight = pyro.sample("weight", dist.Normal(0, 1).expand(
    regressor = (weight * feature).sum(-1)

    # weekly seasonality as independent events
    with pyro.plate("day_of_week", 7, dim=-1):
      seasonal = pyro.sample("seasonal", dist.Normal(0, 5))
    seasonal = periodic_repeat(seasonal, duration, dim=-1)

    drift_stability = pyro.sample("drift_stability", dist.Uniform(1, 2))
    drift_scale = pyro.sample("drift_scale", dist.LogNormal(-17, 5))

    # introduce drift
    with self.time_plate:
          # We combine two different reparameterizers: the inner SymmetricStableReparam
          # is needed for the Stable site, and the outer LocScaleReparam improves inference.
          with poutine.reparam(config={"drift": LocScaleReparam()}):
              with poutine.reparam(config={"drift": SymmetricStableReparam()}):
                  drift = pyro.sample("drift",
                                      dist.Stable(drift_stability, 0, drift_scale))
    motion = drift.cumsum(dim=-1)

    # predict
    prediction = regressor + trend + seasonal + motion + bias
    # Pyro Forecast is multivariate - univariate timeseries is needed
    prediction = prediction.unsqueeze(-1)

    # heavy tail nose to account for outliers
    stability = pyro.sample("noise_stability", dist.Uniform(1, 2).expand([1]).to_event(1))
    skew = pyro.sample("noise_skew", dist.Uniform(-1, 1).expand([1]).to_event(1))
    scale = pyro.sample("noise_scale", dist.LogNormal(-5, 5).expand([1]).to_event(1))
    noise_dist = dist.Stable(stability, skew, scale)
    with poutine.reparam(config={"residual": StableReparam()}):
      self.predict(noise_dist, prediction)


A big thanks to the Pyro documentation and the development team. The comprehensive writing on the topics and the implementations on GitHub really enable Pyro users to hit the ground running. Our model was a wild experimental mix-up of different models available and we tested what parts work well together.


  1. Pyro Documentation – Tutorial on Forecasting I , II and III
  2. M. Gorinova, D. Moore, M. Hoffman. Automatic Reparameterisation of Probabilistic Programs. 2019 published on ArXiv
  3. M.Hoffman, et al. in Stochastic Variational Inference. 2013. in JMLR
  4. Uber Technologies. 2018. Pyro Forecaster see Documentation

NGBoost incremental Forecasting

or how to predict something you don’t know with confidence-intervals.

Currently there is a prominent forecasting challenge happening on Kaggle, the M5 Forecasting Challenge . Given are the sales of product-items over a long time range (1913 days or around 5 years), calendar information and price-info. An excellent in-depth analysis by M. Henze can be found here. The goal of the challenge is, to make informed predictions on how the sales for the individual items will continue over the upcoming 28 days.
One can approach this challenge with different methods: LSTMs, MCMC methods and others come to mind. This article will focus on how to generate predictions through a relatively novel approach – NGBoosting.

NGBoost was developed by the Stanford ML group and was published by Tony Duan, Anand Avati and others on arXiv in 2019. This approach optimizes boosting by utilizing natural gradients in contrast to other boosting methods, like LightGBM or XGBoost. It reduces overfitting on known data-points and reaches at least state-of the art performance on standard benchmarking datasets – see [1] for reference and insights. Additionally, after fitting a model, we can generate PDFs (probability density functions) over the data (see below) and one can even get uncertainty intervals.

Making predictions with Boosting algorithms is not as straight-forward as with LSTMs or MCMC methods. An LSTM can generate new data over specified time-steps after the model has learned the underlying structure of the input data. If you fit a boosting model and simply hit the “predict 28 days”-button, it would spit out a consistent value 28 times.
So lets get the party started in your favorite editor or notebook:

!pip install ngboost
from ngboost import NGBRegressor

df = # load the dataset as provided on kaggle ...
X, y = # split the df into X and y as described below
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

ngb_clf = NGBRegressor(n_estimators=100, Score=LogScore, minibatch_frac=0.5), y_train)


Now that we have predicted something we want to look into the future – one look at a time. The approach to get adjusted predictions, is to take predicted data into account. That means we refit the model and make the next prediction incrementally. You can find a possible algorithm for this in Fig. 3. Lets say we have our input data in the format below (see Fig. 1). What we need in addition is an empty column for our next day (e.g. day 1914).

Fig. 1 – sales dataframe original – we have a row per item with 6 item-features (id, item_id, .., state_id) and 1913 days. In its original size we have 30490 items.

We then wrangle with our data (you might use python’s pandas, the R-tidyverse or maybe you like Excel) and come up with the beautiful dataframe below (see Fig. 2), that means we have: (1.) the sales as our y-value or what we want to fit and predict, (2.) one-hot-encoded the categories that we are interested in, e.g. time-features or price-features, etc. and (3.) label-encoded the item_id (this is optional, but helps to assign the predictions later). Now we follow the pseudo-code below:

Fig. 2 Shape of the input data categories are one-hot encoded and days are depicted as d-column.
  • We split the dataframe in the days that we know (e.g. 1913+i where i is the timestep we are working with) lets call this X_train and the new data that we don’t know, called X_pred
  • We train the regressor on the data that we have -> X_train.
  • We call the predict function of the fitted regressor on X_pred
  • We repeat the process until we have all 28 days
Fig. 3 – possible approach to incrementally generate predictions

The user-guide to what ngboost offers and how that works can be found on the Stanford ML groups site.
The complete code to our solution and a notebook will be provided through a github-link, once the challenge is concluded in June 2020.

Happy Coding!


  1. Tony Duan, Anand Avati, et. al. NGBoost: Natural Gradient Boosting for Probabilistic Prediction. v3. 2020 available on arXiv
  2. The NGBoost User Guide
  3. The M5 Kaggle Challenge

Belief Networks

or Bayesian Belief Networks combine probability and graph theory to represent probabilistic models in a structured way.

Read: 12 min

  • translate graphical models to do inference
  • read and interpret graphical models
  • identify dependencies and causal links
  • work through an example


Belief Networks (BN) combine probabilities and their graphical representation to show causal relationships and assumptions about the parameters – e.g. independence between parameters. This is done through nodes and directed links (vertices) between the nodes to form a Directed Acyclic Graph (DAG). This also allows to display operations like conditioning on parameters or marginalizing over parameters.

A DAG is a graph with directed vertices
that does not contain any cycles. 
Fig 1. A basic representation of a Belief Network as a graph.

Use and Properties

In Fig.1 we can see a model with four parameters lets say rain (R), Jake (J), Thomas (T) and sprinkler (S). We observe that R has an effect on our parameters T and J. We can make a distinction between cause and effect – e.g. the rain is causal to Jake being wet.
Further we can model or observe constraints that our model holds, such that we do not have to account for all combination of possible parameters (2^N) and can reduce the space and computational costs.
This graph can represent different independence assumptions through the directions of the vertices (arrows).

Contitional Independence

We know that when conditional independence holds we can express a joint probability as p(x_1,x_2,x_3)=(x_i_1 | x_i_2,x_i_3,)p(x_i_2|x_i_3)p(x_i_3). This gives six possible permutations, so that simply drawing a graph does not work.

Remember there should not be any cycles in the graph. Therefore we have to drop at least two vertices such that we get 4 possible DAGs from the 6 permutations.
Now, when two vertices point to one and the same node we get a collision:

Fig. 2 a) Collisions and induced dependence
Fig. 2b) Model conditioned on T

In case a collision occurs we get either d-separation or a d-connection. For example (Fig. 2) we see that X and Y are independent of each other and are ’cause’ to the effect Z. We can also write p(X,Y,Z)=p(Z|X, Y)p(X)p(Y).
However if we would condition a model on the collision node (see Fig. 2b), we get p(X,Y|T)\neq p(X|T)p(Y|T) and X and Y are not independent anymore.
A connection is introduced on the right model in Fig. 2a where X and Y are now conditionally dependent given Z.

Belief Networks encode conditional independence well, they do not encode dependence well.

Definition: Two nodes (\mathcal{X} and \mathcal{Y}) are d-separated by \mathcal{Z} in a graph G are if and only if they are not d-connected.
Pretty straight forward, right?

A more applicable explanation is: for every variable x in \mathcal{X} and y in \mathcal{Y}, trace every path U between x and y, if all paths are blocked then two nodes are d-separated. For the definition of blocking and descendants of nodes you can find more in this paper.

Remember: X is conditionally independent of Y if p(X,Y)=p(X)p(Y).

Now away from the theory to something more practical:

How to interact with a model

  1. Analyze the problem and/or set of parameters
    1. set a graphical structure for the parameters in a problem with a DAG
    2. OR reconstruct the problem setting from the parameters
  2. Compile a table of all required conditional probabilities – p(x_i|p_a(x_i))
  3. Set and specify parental relations


Recap Fig. 1 – our short example case

Thomas and Jake are neighbors. When Thomas gets out of the house in the morning he observes that the grass is wet (effect). We want to know how likely it is that Thomas forgot to turn of the sprinkler (S). Knowing that it had rained the day before explains away that the grass is wet in the morning. So let us look at the probabilities:

We have 4 boolean (yes/no) variables (see Fig. 1):
R (rain [0,1]), S (sprinkler [0,1]), T (thomas’s-grass [0,1]), J (jake’s-grass [0,1]). So that we can say that Thomas’s grass is wet, given that Jake’s grass is wet, the sprinkler was off and it rained as p(T=1 | J=1, S=0, R=1).
Already in such small example we have a lot of possible states, namely 2^N=2^4=16 values. We already know that our model has constraints, e.g. the sprinkler is independent of the rain and when it rains both Thomas’s and Jake’s grass is wet.

After taking into account our constraints we can factorize our joint probability into
p(T, J, R, S) = p(T|R,S)p(J|R)p(R)p(S) – we say: “The joint probability is computed by the probability that Thomas’s grass is wet, given rain and sprinkler multiplied with the probability that Jake’s grass is wet, given that it rained and the probability that it rained and the probability that the sprinkler was on.” We have now reduced our problem to 4+2+1+1=8 values. Neat!

We use factorization of joint probabilities to reduce the total number of required states and their (conditional) probabilities.

p(J=1 | R=1)1.0
p(J=1 | R=0)*0.2
p(T=1 | R=1, S=0)1.0
p(T=1 | R=1, S=1)1.0
p(T=1 | R=0, S=1)0.9
p(T=1 | R=0, S=0)0.0
Our conditional probability table (CPT)
* sometimes Jake’s grass is simply wet – we don’t know why…

We are still interested if Thomas left the sprinkler on. Therefore compute the posterior probability p(S=1|T=1) = 0.3382 (see below).

    \begin{align*} p(S=1|T=1) &= \frac{p(S=1,T=1)}{p(T=1)}\\ &=\frac{\sum_{J,R}p(T=1, J, R, S=1)}{\sum_{J,R,S}p(T=1, J, R, S)}\\ &=\frac{\sum_R p(T=1|R,S=1)p(R)p(S=1)}{\sum_{R,S}p(T=1|R,S)p(R)p(S)}\\ &=\frac{0.9*0.8*0.1+1*0.2*0.1}{0.9*0.8*0.1+1*0.2*0.1+0+1*0.2*0.9}\\ &= 0.3382 \end{align*}

When we compute the posterior given that Jake’s grass is also wet we get
p(S=1|T=1, J=1) = 0.1604

    \begin{align*} p(S=1|T=1, J=1) &= \frac{p(S=1,T=1, J=1)}{p(T=1, J=1)}\\ &=\frac{\sum_R p(J=1|R)p(T=1|R,S=1)p(R)p(S=1)}{\sum_{R,S}p(J=1|R)p(T=1|R,S)p(R)p(S)}\\ &=\frac{0.9*0.8*0.1+1*0.2*0.1}{0.9*0.8*0.1+1*0.2*0.1+0+1*0.2*0.9}\\ &= \frac{0.0344}{0.2144}=0.1604 \end{align*}

We have shown that it is less likely that Thomas sprinkler is on, when Jake’s grass is also wet. Where Jake’s grass is extra evidence that affects the likelihood of our observed effect.


Fig. 3 – Uncertainties and unreliable evidence

In case an observed outcome holds uncertainty or we do not trust the source of our values we can account for that. Lets assume the sprinkler malfunctions sometimes, so that instead of a hard [0,1] we get a vector S=[0.2, 0.8] instead, also called soft evidence. We denote this with dashed nodes (see Fig. 2).
Lets assume Jake’s house is quite far away and there is an added unreliability in the evidence that his grass is wet — maybe Jake is a notorious liar about the wetness of his soil. We denote this with a dotted vertex (see Fig.2).
To solve and to account for uncertainties and unreliability is a whole different topic, which we will address in another post.


Some dependency statements cannot be represented structurally in this way; for example marginalized graphs.
One famous example that BNs hold errors when it comes to causal relations is the Simpson Paradoxon . This can be solved with atomic intervention, which is a topic for yet another post.


  • (Bayesian) Belief Networks represent probabilistic models as well as the factorization of distributions into conditional probabilities
  • BNs are directed acyclic graphs (DAGs)
  • We can reduce the amount of computation and space required by taking into account constraints from the model which are expressed structurally
  • conditional independence is expressed as absence of a link in a network


  1. D. Barber, Bayesian Reasoning and Machine Learning. Cambridge University Press. USA. 2012: pp.29-51
  2. G. Pipa, Neuroinformatics Lecture Script. 2015: pp.19-26