or how to predict something you don’t know with confidence-intervals.
Currently there is a prominent forecasting challenge happening on Kaggle, the M5 Forecasting Challenge . Given are the sales of product-items over a long time range (1913 days or around 5 years), calendar information and price-info. An excellent in-depth analysis by M. Henze can be found here. The goal of the challenge is, to make informed predictions on how the sales for the individual items will continue over the upcoming 28 days.
One can approach this challenge with different methods: LSTMs, MCMC methods and others come to mind. This article will focus on how to generate predictions through a relatively novel approach – NGBoosting.
NGBoost was developed by the Stanford ML group and was published by Tony Duan, Anand Avati and others on arXiv in 2019. This approach optimizes boosting by utilizing natural gradients in contrast to other boosting methods, like LightGBM or XGBoost. It reduces overfitting on known data-points and reaches at least state-of the art performance on standard benchmarking datasets – see  for reference and insights. Additionally, after fitting a model, we can generate PDFs (probability density functions) over the data (see below) and one can even get uncertainty intervals.
Making predictions with Boosting algorithms is not as straight-forward as with LSTMs or MCMC methods. An LSTM can generate new data over specified time-steps after the model has learned the underlying structure of the input data. If you fit a boosting model and simply hit the “predict 28 days”-button, it would spit out a consistent value 28 times.
So lets get the party started in your favorite editor or notebook:
!pip install ngboost from ngboost import NGBRegressor df = # load the dataset as provided on kaggle ... X, y = # split the df into X and y as described below X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) ngb_clf = NGBRegressor(n_estimators=100, Score=LogScore, minibatch_frac=0.5) ngb_clf.fit(X_train, y_train) ngb_clf.predict(X_test)
Now that we have predicted something we want to look into the future – one look at a time. The approach to get adjusted predictions, is to take predicted data into account. That means we refit the model and make the next prediction incrementally. You can find a possible algorithm for this in Fig. 3. Lets say we have our input data in the format below (see Fig. 1). What we need in addition is an empty column for our next day (e.g. day 1914).
We then wrangle with our data (you might use python’s pandas, the R-tidyverse or maybe you like Excel) and come up with the beautiful dataframe below (see Fig. 2), that means we have: (1.) the sales as our y-value or what we want to fit and predict, (2.) one-hot-encoded the categories that we are interested in, e.g. time-features or price-features, etc. and (3.) label-encoded the item_id (this is optional, but helps to assign the predictions later). Now we follow the pseudo-code below:
- We split the dataframe in the days that we know (e.g. 1913+i where i is the timestep we are working with) lets call this X_train and the new data that we don’t know, called X_pred
- We train the regressor on the data that we have -> X_train.
- We call the predict function of the fitted regressor on X_pred
- We repeat the process until we have all 28 days
The user-guide to what ngboost offers and how that works can be found on the Stanford ML groups site.
The complete code to our solution and a notebook will be provided through a github-link, once the challenge is concluded in June 2020.
- Tony Duan, Anand Avati, et. al. NGBoost: Natural Gradient Boosting for Probabilistic Prediction. v3. 2020 available on arXiv
- The NGBoost User Guide
- The M5 Kaggle Challenge