NGBoost incremental Forecasting

or how to predict something you don’t know with confidence-intervals.

Currently there is a prominent forecasting challenge happening on Kaggle, the M5 Forecasting Challenge . Given are the sales of product-items over a long time range (1913 days or around 5 years), calendar information and price-info. An excellent in-depth analysis by M. Henze can be found here. The goal of the challenge is, to make informed predictions on how the sales for the individual items will continue over the upcoming 28 days.
One can approach this challenge with different methods: LSTMs, MCMC methods and others come to mind. This article will focus on how to generate predictions through a relatively novel approach – NGBoosting.

NGBoost was developed by the Stanford ML group and was published by Tony Duan, Anand Avati and others on arXiv in 2019. This approach optimizes boosting by utilizing natural gradients in contrast to other boosting methods, like LightGBM or XGBoost. It reduces overfitting on known data-points and reaches at least state-of the art performance on standard benchmarking datasets – see [1] for reference and insights. Additionally, after fitting a model, we can generate PDFs (probability density functions) over the data (see below) and one can even get uncertainty intervals.

Making predictions with Boosting algorithms is not as straight-forward as with LSTMs or MCMC methods. An LSTM can generate new data over specified time-steps after the model has learned the underlying structure of the input data. If you fit a boosting model and simply hit the “predict 28 days”-button, it would spit out a consistent value 28 times.
So lets get the party started in your favorite editor or notebook:

!pip install ngboost
from ngboost import NGBRegressor

df = # load the dataset as provided on kaggle ...
X, y = # split the df into X and y as described below
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

ngb_clf = NGBRegressor(n_estimators=100, Score=LogScore, minibatch_frac=0.5), y_train)


Now that we have predicted something we want to look into the future – one look at a time. The approach to get adjusted predictions, is to take predicted data into account. That means we refit the model and make the next prediction incrementally. You can find a possible algorithm for this in Fig. 3. Lets say we have our input data in the format below (see Fig. 1). What we need in addition is an empty column for our next day (e.g. day 1914).

Fig. 1 – sales dataframe original – we have a row per item with 6 item-features (id, item_id, .., state_id) and 1913 days. In its original size we have 30490 items.

We then wrangle with our data (you might use python’s pandas, the R-tidyverse or maybe you like Excel) and come up with the beautiful dataframe below (see Fig. 2), that means we have: (1.) the sales as our y-value or what we want to fit and predict, (2.) one-hot-encoded the categories that we are interested in, e.g. time-features or price-features, etc. and (3.) label-encoded the item_id (this is optional, but helps to assign the predictions later). Now we follow the pseudo-code below:

Fig. 2 Shape of the input data categories are one-hot encoded and days are depicted as d-column.
  • We split the dataframe in the days that we know (e.g. 1913+i where i is the timestep we are working with) lets call this X_train and the new data that we don’t know, called X_pred
  • We train the regressor on the data that we have -> X_train.
  • We call the predict function of the fitted regressor on X_pred
  • We repeat the process until we have all 28 days
Fig. 3 – possible approach to incrementally generate predictions

The user-guide to what ngboost offers and how that works can be found on the Stanford ML groups site.
The complete code to our solution and a notebook will be provided through a github-link, once the challenge is concluded in June 2020.

Happy Coding!


  1. Tony Duan, Anand Avati, et. al. NGBoost: Natural Gradient Boosting for Probabilistic Prediction. v3. 2020 available on arXiv
  2. The NGBoost User Guide
  3. The M5 Kaggle Challenge

Belief Networks

or Bayesian Belief Networks combine probability and graph theory to represent probabilistic models in a structured way.

Read: 12 min

  • translate graphical models to do inference
  • read and interpret graphical models
  • identify dependencies and causal links
  • work through an example


Belief Networks (BN) combine probabilities and their graphical representation to show causal relationships and assumptions about the parameters – e.g. independence between parameters. This is done through nodes and directed links (vertices) between the nodes to form a Directed Acyclic Graph (DAG). This also allows to display operations like conditioning on parameters or marginalizing over parameters.

A DAG is a graph with directed vertices
that does not contain any cycles. 
Fig 1. A basic representation of a Belief Network as a graph.

Use and Properties

In Fig.1 we can see a model with four parameters lets say rain (R), Jake (J), Thomas (T) and sprinkler (S). We observe that R has an effect on our parameters T and J. We can make a distinction between cause and effect – e.g. the rain is causal to Jake being wet.
Further we can model or observe constraints that our model holds, such that we do not have to account for all combination of possible parameters (2^N) and can reduce the space and computational costs.
This graph can represent different independence assumptions through the directions of the vertices (arrows).

Contitional Independence

We know that when conditional independence holds we can express a joint probability as p(x_1,x_2,x_3)=(x_i_1 | x_i_2,x_i_3,)p(x_i_2|x_i_3)p(x_i_3). This gives six possible permutations, so that simply drawing a graph does not work.

Remember there should not be any cycles in the graph. Therefore we have to drop at least two vertices such that we get 4 possible DAGs from the 6 permutations.
Now, when two vertices point to one and the same node we get a collision:

Fig. 2 a) Collisions and induced dependence
Fig. 2b) Model conditioned on T

In case a collision occurs we get either d-separation or a d-connection. For example (Fig. 2) we see that X and Y are independent of each other and are ’cause’ to the effect Z. We can also write p(X,Y,Z)=p(Z|X, Y)p(X)p(Y).
However if we would condition a model on the collision node (see Fig. 2b), we get p(X,Y|T)\neq p(X|T)p(Y|T) and X and Y are not independent anymore.
A connection is introduced on the right model in Fig. 2a where X and Y are now conditionally dependent given Z.

Belief Networks encode conditional independence well, they do not encode dependence well.

Definition: Two nodes (\mathcal{X} and \mathcal{Y}) are d-separated by \mathcal{Z} in a graph G are if and only if they are not d-connected.
Pretty straight forward, right?

A more applicable explanation is: for every variable x in \mathcal{X} and y in \mathcal{Y}, trace every path U between x and y, if all paths are blocked then two nodes are d-separated. For the definition of blocking and descendants of nodes you can find more in this paper.

Remember: X is conditionally independent of Y if p(X,Y)=p(X)p(Y).

Now away from the theory to something more practical:

How to interact with a model

  1. Analyze the problem and/or set of parameters
    1. set a graphical structure for the parameters in a problem with a DAG
    2. OR reconstruct the problem setting from the parameters
  2. Compile a table of all required conditional probabilities – p(x_i|p_a(x_i))
  3. Set and specify parental relations


Recap Fig. 1 – our short example case

Thomas and Jake are neighbors. When Thomas gets out of the house in the morning he observes that the grass is wet (effect). We want to know how likely it is that Thomas forgot to turn of the sprinkler (S). Knowing that it had rained the day before explains away that the grass is wet in the morning. So let us look at the probabilities:

We have 4 boolean (yes/no) variables (see Fig. 1):
R (rain [0,1]), S (sprinkler [0,1]), T (thomas’s-grass [0,1]), J (jake’s-grass [0,1]). So that we can say that Thomas’s grass is wet, given that Jake’s grass is wet, the sprinkler was off and it rained as p(T=1 | J=1, S=0, R=1).
Already in such small example we have a lot of possible states, namely 2^N=2^4=16 values. We already know that our model has constraints, e.g. the sprinkler is independent of the rain and when it rains both Thomas’s and Jake’s grass is wet.

After taking into account our constraints we can factorize our joint probability into
p(T, J, R, S) = p(T|R,S)p(J|R)p(R)p(S) – we say: “The joint probability is computed by the probability that Thomas’s grass is wet, given rain and sprinkler multiplied with the probability that Jake’s grass is wet, given that it rained and the probability that it rained and the probability that the sprinkler was on.” We have now reduced our problem to 4+2+1+1=8 values. Neat!

We use factorization of joint probabilities to reduce the total number of required states and their (conditional) probabilities.

p(J=1 | R=1)1.0
p(J=1 | R=0)*0.2
p(T=1 | R=1, S=0)1.0
p(T=1 | R=1, S=1)1.0
p(T=1 | R=0, S=1)0.9
p(T=1 | R=0, S=0)0.0
Our conditional probability table (CPT)
* sometimes Jake’s grass is simply wet – we don’t know why…

We are still interested if Thomas left the sprinkler on. Therefore compute the posterior probability p(S=1|T=1) = 0.3382 (see below).

    \begin{align*} p(S=1|T=1) &= \frac{p(S=1,T=1)}{p(T=1)}\\ &=\frac{\sum_{J,R}p(T=1, J, R, S=1)}{\sum_{J,R,S}p(T=1, J, R, S)}\\ &=\frac{\sum_R p(T=1|R,S=1)p(R)p(S=1)}{\sum_{R,S}p(T=1|R,S)p(R)p(S)}\\ &=\frac{0.9*0.8*0.1+1*0.2*0.1}{0.9*0.8*0.1+1*0.2*0.1+0+1*0.2*0.9}\\ &= 0.3382 \end{align*}

When we compute the posterior given that Jake’s grass is also wet we get
p(S=1|T=1, J=1) = 0.1604

    \begin{align*} p(S=1|T=1, J=1) &= \frac{p(S=1,T=1, J=1)}{p(T=1, J=1)}\\ &=\frac{\sum_R p(J=1|R)p(T=1|R,S=1)p(R)p(S=1)}{\sum_{R,S}p(J=1|R)p(T=1|R,S)p(R)p(S)}\\ &=\frac{0.9*0.8*0.1+1*0.2*0.1}{0.9*0.8*0.1+1*0.2*0.1+0+1*0.2*0.9}\\ &= \frac{0.0344}{0.2144}=0.1604 \end{align*}

We have shown that it is less likely that Thomas sprinkler is on, when Jake’s grass is also wet. Where Jake’s grass is extra evidence that affects the likelihood of our observed effect.


Fig. 3 – Uncertainties and unreliable evidence

In case an observed outcome holds uncertainty or we do not trust the source of our values we can account for that. Lets assume the sprinkler malfunctions sometimes, so that instead of a hard [0,1] we get a vector S=[0.2, 0.8] instead, also called soft evidence. We denote this with dashed nodes (see Fig. 2).
Lets assume Jake’s house is quite far away and there is an added unreliability in the evidence that his grass is wet — maybe Jake is a notorious liar about the wetness of his soil. We denote this with a dotted vertex (see Fig.2).
To solve and to account for uncertainties and unreliability is a whole different topic, which we will address in another post.


Some dependency statements cannot be represented structurally in this way; for example marginalized graphs.
One famous example that BNs hold errors when it comes to causal relations is the Simpson Paradoxon . This can be solved with atomic intervention, which is a topic for yet another post.


  • (Bayesian) Belief Networks represent probabilistic models as well as the factorization of distributions into conditional probabilities
  • BNs are directed acyclic graphs (DAGs)
  • We can reduce the amount of computation and space required by taking into account constraints from the model which are expressed structurally
  • conditional independence is expressed as absence of a link in a network


  1. D. Barber, Bayesian Reasoning and Machine Learning. Cambridge University Press. USA. 2012: pp.29-51
  2. G. Pipa, Neuroinformatics Lecture Script. 2015: pp.19-26


Verantwortlicher für die Erhebung, Verarbeitung und Nutzung Ihrer personenbezogenen Daten im Sinne von Art. 4 Nr. 7 DSGVO ist

apply-lambda v/Richard Michael
Sigynsgade 55, 2-tv.
DK-2200 Copenhagen

Betroffenenrechte: Rechte auf Auskunft, Berichtigung, Sperre, Löschung und Widerspruch
Sie haben das Recht, auf Antrag unentgeltlich eine Auskunft über die bei uns gespeicherten personenbezogenen Daten anzufordern und/oder eine Berichtigung, Sperrung oder Löschung zu verlangen. Eine Sperrung oder Löschung kann nicht erfolgen, wenn gesetzliche Regelungen dem entgegenstehen.

Datenvermeidung und Datensparsamkeit
Wir speichern gemäß den Grundsätzen der Datenvermeidung und Datensparsamkeit personenbezogene Daten nur so lange, wie dies erforderlich ist oder vom Gesetzgeber vorgeschrieben wird.

Erfassung allgemeiner Informationen
Informationen zu einem Zugriff auf unseren Internetauftritt werden grundsätzlich nicht automatisiert (in Form von Logfiles) erfasst. Für administrative Zwecke kann eine kurzfristige Erstellung von Logfiles notwendig werden. Die Logfiles werden in diesem Fall anschließend gelöscht. Die Logfiles sind von allgemeiner Natur und erlauben keine Rückschlüsse auf Ihre Person.

Sofern Logfiles erstellt werden, können unter anderem folgende Datenarten erfasst werden: IP-Adresse und Datum.

Die Daten, die Sie für die Anmeldung an einer oder mehrerer Mailinglisten angegeben haben, werden von uns ausschließlich für diesen Zweck verwendet. Es werden keine Daten an Dritte weitergegeben.