Pyro Top-Down Forecasting | Application-case

Connect the dots over time and forecast with confidence(-intervals).

Connecting dots across time – photo by israel palacio on Unsplash

Have you ever wondered how to account for uncertainties in time-series forecasts?
Have you ever thought there should be a way to generate data from previously seen data and make judgement calls about certainties? I know we have.
If you want to build models that capture probabilities and hold confidences we recommend using a probabilistic programming framework like Pyro.
In a previous article we have looked at NGBoosting and have applied it to the M5 forecasting challenge on Kaggle. As a quick recap – the M5 forecasting challenge asks us to predict how the sales of Walmart items will develop over time. It provides around 4-5 years of data for items of different categories from different stores across different states and asks us to forecast 28 days that we have no information about. As an overview over the challenge and data-set we still recommend this amazing notebook. Last time we concluded, that we got better results with Pyro and here is a simple walk-through of how we got there.

There are different approaches to modeling and forecasting data over time. There are top-down models, state-space models and hierarchical models – to name a selected few.
In this article we see how a prediction can be done through a very rough and rudimentary top-down model. We make thorough use of the existing Forecaster object in Pyro. We just tell the model how to make an informed prediction. After that we dump all the data into it. That’s it – simple enough! No fancy priors, no fancy underlying distribution-assumptions, just data and a probabilistic framework. Some assumptions we still have to make and we will walk you through this.

Sidenote: One more elegant way to do forecasting is with hierarchical models. These models allow accounting for different distributions over different categories in your data. You can assign individual priors and make use of information that you have about your model. The elegance of such an approach will be covered in another post.

In the end, we dump all our time-series information that we have into the model, which is sales over time, while preserving a sale as an independent random event . We don’t use anything else. If the Pyro programming framework, specifically the ForecastingModel is the engine to our model we also need a driver.
The element that drives our model in the right direction is this:

prediction = regressor + trend + seasonal + motion + bias

This simple line combines Pyro objects from which we sample over time. Now, lets look at what all those parts mean:

Composing the model

The actual model part is captured by the regressor. This piece looks for proper model-weights and throws them against the features which come directly from the data-input. Each weight lies on a classical Gaussian curve. This means we sample from a normal distribution centered around zero with a standard deviation of 1 (\mu=0,\sigma^2=1).

 weight = pyro.sample("weight", dist.Normal(0, 1).expand(
        [feature.size(-1)]).to_event(1))
 regressor = (weight * feature).sum(-1)

The second parameter is the trend. We consult our probabilisitc looking-glass and then decide it should come from a Log-Normal distribution, such that l\sim\mathcal{LN} with \mu=-2 and \sigma^2=1. A key element is that we account for the time-feature that we provide the model with.

trend_coef = pyro.sample("trend", dist.LogNormal(-2, 1))
trend = trend_coef * time

Now what is a trend without a season? Trends come and got, but the seasons are what we can count on. We know that the underlying data describes purchases over time. We think that a change in sales is captured by what day it is during the week. Therefore we say

with pyro.plate("day_of_week", 7, dim=-1):
    seasonal = pyro.sample("seasonal", dist.Normal(0, 5))
seasonal = periodic_repeat(seasonal, duration, dim=-1)

Here we make use of another neat Pyro object: the plate. This construct is used to model independence of events and by event we mean to make an observation of an instance occurring. This observation comes from each time-step of our data. A thorough explanation can be found in the Pyro SVI tutorials. For our purposes we break it down and say that we think that the data rises and falls within a seven day window. We translate this to independently sampled events. Last but not least there is not just one seven day interval, but we periodically repeat this over all of our given time-input aka. our duration is the size of our time-tensor.

Lastly, every professional ML-model needs some trainable parameter that accounts for the fact that sometimes things are just not as nice as you expect them to be – we call this a bias. We decide for our bias b that it comes from a normal distribution, such that b\sim\mathcal{N} where \mu=0 and \sigma^2=10 or loc=0 and scale=10 if you want to talk Pyro.

bias = pyro.sample("bias", dist.Normal(0, 10))

Let’s take a look at what the distributions of our parameters – before we continue.

Fig. 1 – Density Plots over all sampled parameters by color. Y-axis is clipped and the max for the trend log-normal distribution is >2.5 .

You, the attentive reader, might conclude at this point. You have everything you need and we agree. What we have now is sufficient to give us a good estimate and make projections for a couple of days into the future.
We can make it better though. The prediction so far, all by itself is a really stiff expression. The model is told to fit the input but we don’t allow for leeway.
Let’s assume, for the sake of argument, that all people in California decide they rather go shopping on a Thursday afternoon instead of a Saturday; or a crisis occurs and everybody goes to prepare for the apocalypse. Our model might have correctly predicted all previous Saturdays, but that one Thursday really throws it off. To account for that we introduce something called reparameterization. For an extensive overview of this topic see [2].
To make it even more robust we do a reparameterization through another reparameterization – why we do this is explained here.
Packing everything together we get a drift which defined our motion. And this motion is the “m” parameter in our model. This translates to Pyro as such:

drift_stability = pyro.sample("drift_stability", dist.Uniform(1, 2))
drift_scale = pyro.sample("drift_scale", dist.LogNormal(-17, 5))
with self.time_plate:
   with poutine.reparam(config={"drift": LocScaleReparam()}):
      with poutine.reparam(config={"drift": SymmetricStableReparam()}):
         drift = pyro.sample("drift", dist.Stable(drift_stability, 0, drift_scale))
motion = drift.cumsum(dim=-1)

Now we have all we need for our model to work.

Your model at work

For the actual model-fitting part we use Pyro’s Stochastic Variational Inference engine or SVI . This algorithm allows us to efficiently compute our posterior distribution in a reasonable amount of time. In our case this is done to maximize the Evidence Lower Bound or ELBO. One simplified way to think about ELBO is to observe two different distributions, a current one and a previous one. If the distributions are wildly different we get a low ELBO value. On the other hand if I have fitted a good distribution my next one might be just as nice, aligns well with the already observed one and I get a high ELBO value. Some might even say that the divergence of both models is minimized . To read about it in Detail we refer to the original paper for SVI (see [3]). For now lets assume this just works because smart researchers have implemented it sufficiently.

We will go into the interplay of Stochastic Variational Inference and ELBO (or minimizing KL) in another article. 

Now that we have put together a working model, have selected our algorithm to compute our distributions, we have to set the hyperparameters, load the data and start training. For data-handling we extensively attend to the M5 Starter Kit implementation found on github, which the Pyro PPL team provides generously. This is an easy way to get the aggregates over e.g. the sales data. Feel free to implement this yourself. Its a sure way to learn your way around tensors. For our beginner model we only care about the sales data over time.
We now set our parameters and instantiate the forecaster as follows:

forecaster_opt = {
    "learning_rate": 0.1,
    "learning_rate_decay": 0.1,
    "clip_norm": 10,
    "num_steps": 3501,
    "log_every": 100,
}
forecaster = Forecaster(TopDownModel(), data, covariates[:-28], **forecaster_opt)
samples = forecaster(data, covariates, num_samples=1000).exp().squeeze(-1).cpu()
pred = samples.mean(0)

In the above code we fit the forecaster which under the hood uses a DCTAdam optimizer. We provide the covariates for the training and withhold 28 days, which we use for the actual forecast in the line below. The actual prediction is then the mean across the samples, which will be used for our kaggle submission. Lets not get ahead of ourselves here. When we run the above codes and have implemented our model correctly we should get the following output – the actual training:

Forecaster training process with the specified model over all the training data for 3501 steps. The loss is minimized over time.

We see that our loss decreases. This is good and tells us that the computed distributions from our prediction get better over time. The loss here is a quotient of the negative ELBO with respect to the data and part of the Forecaster documentation (see [4]).

The Real Uncertain Thing – with confidence and everything

We now hold a final model in our memory. We can compute losses with it, throw it against new data, poke it and see what it generates.
For example, we can sample sales over time as you can see in the image below:

Fig. 2 – Real data and data sampled from the fitted model over time. The last available (truth) 14 days are depicted by the red line, whereas the sampled-predictions by the model are light green. We can now forecast the required 28 days with the mean (dark-green) that serves as prediction. A confidence interval from 10%-90% for each time-step is displayed with red shading.

To compute the confidence of the individual steps we ask our model for 100 samples and compute quantiles (0.1 and 0.9) for those samples. This is the red area that you can see in Fig. 2 . The sales in the figure are aggregated logged-values for better model-fitting. To get values in a more realistic range transform the values using numpy’s exp.

To Summarize

We have shown how you can build a model. Use the facts that you know about the problem to set the model parameters and find appropriate distributions. A lot of functionality comes with Pyro, such as SVI an algorithm to compute the posterior, which makes the model-training possible in the first-place. After we have fitted our model we can sample from it, forecast datapoints and compute confidences on our predictions.

Complete Model Code

class TopDownModel(ForecastingModel):
  """
  Top-Down Hierarchical Forecasting Model
  """
  def model(self, zero_data, covariates):
    # check univariate data
    assert zero_data.size(-1) == 1 
    duration = zero_data.size(-2)

    time, feature = covariates[..., 0], covariates[..., 1:]

    bias = pyro.sample("bias", dist.Normal(0, 10))
    trend_coef = pyro.sample("trend", dist.LogNormal(-2, 1))
    trend = trend_coef * time

    weight = pyro.sample("weight", dist.Normal(0, 1).expand(
        [feature.size(-1)]).to_event(1))
    regressor = (weight * feature).sum(-1)

    # weekly seasonality as independent events
    with pyro.plate("day_of_week", 7, dim=-1):
      seasonal = pyro.sample("seasonal", dist.Normal(0, 5))
    seasonal = periodic_repeat(seasonal, duration, dim=-1)

    drift_stability = pyro.sample("drift_stability", dist.Uniform(1, 2))
    drift_scale = pyro.sample("drift_scale", dist.LogNormal(-17, 5))

    # introduce drift
    with self.time_plate:
          # We combine two different reparameterizers: the inner SymmetricStableReparam
          # is needed for the Stable site, and the outer LocScaleReparam improves inference.
          with poutine.reparam(config={"drift": LocScaleReparam()}):
              with poutine.reparam(config={"drift": SymmetricStableReparam()}):
                  drift = pyro.sample("drift",
                                      dist.Stable(drift_stability, 0, drift_scale))
    motion = drift.cumsum(dim=-1)

    # predict
    prediction = regressor + trend + seasonal + motion + bias
    # Pyro Forecast is multivariate - univariate timeseries is needed
    prediction = prediction.unsqueeze(-1)

    # heavy tail nose to account for outliers
    stability = pyro.sample("noise_stability", dist.Uniform(1, 2).expand([1]).to_event(1))
    skew = pyro.sample("noise_skew", dist.Uniform(-1, 1).expand([1]).to_event(1))
    scale = pyro.sample("noise_scale", dist.LogNormal(-5, 5).expand([1]).to_event(1))
    noise_dist = dist.Stable(stability, skew, scale)
    with poutine.reparam(config={"residual": StableReparam()}):
      self.predict(noise_dist, prediction)

Disclaimer

A big thanks to the Pyro documentation and the development team. The comprehensive writing on the topics and the implementations on GitHub really enable Pyro users to hit the ground running. Our model was a wild experimental mix-up of different models available and we tested what parts work well together.


References

  1. Pyro Documentation – Tutorial on Forecasting I , II and III
  2. M. Gorinova, D. Moore, M. Hoffman. Automatic Reparameterisation of Probabilistic Programs. 2019 published on ArXiv
  3. M.Hoffman, et al. in Stochastic Variational Inference. 2013. in JMLR
  4. Uber Technologies. 2018. Pyro Forecaster see Documentation

NGBoost incremental Forecasting

or how to predict something you don’t know with confidence-intervals.

Currently there is a prominent forecasting challenge happening on Kaggle, the M5 Forecasting Challenge . Given are the sales of product-items over a long time range (1913 days or around 5 years), calendar information and price-info. An excellent in-depth analysis by M. Henze can be found here. The goal of the challenge is, to make informed predictions on how the sales for the individual items will continue over the upcoming 28 days.
One can approach this challenge with different methods: LSTMs, MCMC methods and others come to mind. This article will focus on how to generate predictions through a relatively novel approach – NGBoosting.

NGBoost was developed by the Stanford ML group and was published by Tony Duan, Anand Avati and others on arXiv in 2019. This approach optimizes boosting by utilizing natural gradients in contrast to other boosting methods, like LightGBM or XGBoost. It reduces overfitting on known data-points and reaches at least state-of the art performance on standard benchmarking datasets – see [1] for reference and insights. Additionally, after fitting a model, we can generate PDFs (probability density functions) over the data (see below) and one can even get uncertainty intervals.

Making predictions with Boosting algorithms is not as straight-forward as with LSTMs or MCMC methods. An LSTM can generate new data over specified time-steps after the model has learned the underlying structure of the input data. If you fit a boosting model and simply hit the “predict 28 days”-button, it would spit out a consistent value 28 times.
So lets get the party started in your favorite editor or notebook:

!pip install ngboost
from ngboost import NGBRegressor

df = # load the dataset as provided on kaggle ...
X, y = # split the df into X and y as described below
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

ngb_clf = NGBRegressor(n_estimators=100, Score=LogScore, minibatch_frac=0.5)
ngb_clf.fit(X_train, y_train)

ngb_clf.predict(X_test)

Now that we have predicted something we want to look into the future – one look at a time. The approach to get adjusted predictions, is to take predicted data into account. That means we refit the model and make the next prediction incrementally. You can find a possible algorithm for this in Fig. 3. Lets say we have our input data in the format below (see Fig. 1). What we need in addition is an empty column for our next day (e.g. day 1914).

Fig. 1 – sales dataframe original – we have a row per item with 6 item-features (id, item_id, .., state_id) and 1913 days. In its original size we have 30490 items.

We then wrangle with our data (you might use python’s pandas, the R-tidyverse or maybe you like Excel) and come up with the beautiful dataframe below (see Fig. 2), that means we have: (1.) the sales as our y-value or what we want to fit and predict, (2.) one-hot-encoded the categories that we are interested in, e.g. time-features or price-features, etc. and (3.) label-encoded the item_id (this is optional, but helps to assign the predictions later). Now we follow the pseudo-code below:

Fig. 2 Shape of the input data categories are one-hot encoded and days are depicted as d-column.
  • We split the dataframe in the days that we know (e.g. 1913+i where i is the timestep we are working with) lets call this X_train and the new data that we don’t know, called X_pred
  • We train the regressor on the data that we have -> X_train.
  • We call the predict function of the fitted regressor on X_pred
  • We repeat the process until we have all 28 days
Fig. 3 – possible approach to incrementally generate predictions

The user-guide to what ngboost offers and how that works can be found on the Stanford ML groups site.
The complete code to our solution and a notebook will be provided through a github-link, once the challenge is concluded in June 2020.

Happy Coding!

References

  1. Tony Duan, Anand Avati, et. al. NGBoost: Natural Gradient Boosting for Probabilistic Prediction. v3. 2020 available on arXiv
  2. The NGBoost User Guide
  3. The M5 Kaggle Challenge