## Improve Your Model with Missing Data | Imputation with NumPyro

Make the best of missing data the Bayesian way. Improve model performance and make comparative benchmarks using Monte Carlo methods.

## The Ugly Data

How do you handle missing data, gaps in your data-frames or noisy parameters?
You have spent hours at work, in the lab or in the wild to generate or curate a dataset given an interesting research question or hypothesis. Terribly enough, you find that some of the measurements for a parameter are missing!
Another case that might throw you off is unexpected noise that was introduced at some point in the experiment and has doomed some of your measurements to be extreme outliers.
Sometimes you cannot just exile a whole parameter from your analysis, simply because it has 25% missing entries or maybe entails a “does not apply” field in your questionnaire.

Fear not, there is a way you can make the most of your dataset anyways.
In this post we will use Probabilisitc Programming [5] to fill the gaps that your model might face.

## Simulate Missing Data

Lets build a problem from scratch. The studious reader is most likely familiar with the Iris flower data-set. It is somewhat part of the “hello-world” in the ML community; next to the well-known MNIST that is.
We use sklearn.datasets to get the data and put it into a pandas dataframe like so.:

iris_data = datasets.load_iris()
iris_df = pd.DataFrame(data=np.c_[iris_data['data'], iris_data['target']],
columns=iris_data['feature_names'] + ['target'])

To make this example more explicit lets select for only 2 out of the given 3 species.:

iris_df["target"] = iris_df.target.astype("int")
iris_df = iris_df[iris_df.target.isin([0, 1])]

Now you should see 100 measurements (rows that are our n) of the Iris flower data.
We now take out 50% of the petal length entries by chance and assign NaN values with a bit of numpy and pandas magic.

random_vec = np_random(iris_df["petal length (cm)"].shape)<0.5 # 50% marked by numpy random vector on the column
iris_df["petal length (cm)"] = iris_df["petal length (cm)"].where(random_vec, other=np.nan)

## Why do we care about all of our measurements?

When we plot our data, we can clearly see that the petal length and petal width can help us identify which species (target-column) we have (see Fig. 1).

We can make an even clearer distinction by regressing on the groups using just the vector which we have decimated earlier.

Too bad about half of the data is missing. So what do we do now?

First off, there are different degrees of missing data:

Missing Completely at Random (MCAR) and Missing at Random (MAR).

For a formal introduction to this see Bayesian Data Analysis [1] Ch. 18

One important distinction is that MCAR treats your missing parameters as independent of your observed measurements (lets call them y).
In our example we rely on our parameter being missing at random or MAR. That means we don’t think that the chance of a value gone missing is dependent on the parameter property BUT we think that there is potentially some relation between our length measure and the target species.

## A Regression Model to Rule them all

First lets extract some basic info from the data of what we have so far:

>>> print("Petal Length μ = {}".format(iris_df["petal length (cm)"].mean()))
>>> Petal Length μ = 2.8072727272727267
>>> print("Petal Length σ^2 = {}".format(iris_df["petal length (cm)"].std()))
>>> Petal Length σ^2 = 1.4923079993698234

Now we build the model to do the imputation. For this we use the tool NumPyro, which is light-weight probabilistic numpy backend to Pyro – a heavy duty probabilistic language.
This comes in handy after formulating the model. We will use MCMC methods to fit the model and simulate values for missing entries.
To make this more applicable to a use case we combine this with a Bayesian regression-model. That means we want to model the distinction in species given all parameters that we know. Our parameters therefore are:

sepal length := s_len, sepal width := s_width, petal width := p_width and, lastly petal length := p_len, which includes our missing data.

For each of those parameters we introduce a learnable bias which comes from a regular Normal distribution aka Gaussian for the sake of simplicity. So lets get started.:

def model(s_len, s_width, p_len, p_width, target=None):
b_s_len = numpyro.sample("b_s_len", dist.Normal(0,1))
b_s_width = numpyro.sample("b_s_width", dist.Normal(0,1))
b_p_width = numpyro.sample("b_p_width", dist.Normal(0,1))
...

One can do a bit of tweaking for the bias parameters above, but for our initial runs the normal Gaussian worked reasonably well.
Now the delicate part. We introduce the distribution which we assume lies under our half-destroyed parameter. As with a lot of things we assume this to be Gaussian, which in our case is a reasonable assumption. Therefore we take the get loc=\mu (2.8 +/- 0.2) and scale=\sigma^2 (1.5+/-0.2) which we had computed earlier:

...
len_mu = numpyro.sample("p_length_mu", dist.Normal(2.8, 0.2))
len_sigma = numpyro.sample("p_length_sigma", dist.Normal(1.5, 0.2))
...

The informed reader clearly sees how the \mu and \sigma themselves are probabilistic parameters for themselves which we model in the process. For both we assume Gaussians but more rigorous distributions could be chosen.
Now we have to find our targets which we model. Hence we localize our missing positions in the vector with numpy.

...
len_is_nan = np.isnan(p_len)
len_nan_idx = np.array(np.isnan(p_len).astype(int)).nonzero()[0]

if target is not None:
len_impute = numpyro.param("len_impute", np.zeros(len_is_nan.sum()))
else:
len_impute = numpyro.sample("len_impute", dist.Normal(len_mu[len_nan_idx],
len_sigma[len_nan_idx]))
...

Additionally, we tell NumPyro to make the imputation-value (len_impute) a learnable parameter and fill with as many zeroes as values we are missing.
On a short sidenote: our model has a predictive mode, that occurs when no target (aka species) is provided. When we invoke the predictive capabilities, we sample from the introduced distribution with the sampled \mu and sampled \sigma instead. Make sure to only apply this on the NaN positions in your vector! That is the meaning of the else clause:

numpyro.sample("len_impute", dist.Normal(len_mu[len_nan_idx], len_sigma[len_nan_idx])

We then put the imputed values into the right places and give it the inference sampling, with the observed length values to do its optimization. For this we use the JAX library with some efficient index updating – linear-algebra style.:

...
p_len = ops.index_update(p_len, len_nan_idx, len_impute)

numpyro.sample("p_length", dist.Normal(len_mu, len_sigma), obs=p_len)
...

In our length-sampling we include the actual length measures as the observations in the process .

## Putting Things Together | Imputed Values into the Linear Model

For this model to be a proper linear model, we have to fit our target y (the species).
Since we have two classes, we do this by invoking a Bernoulli distribution.
We therefore ask in a Bernoulli process whether we are looking at y=0 or y=1 – two classes, two potential outcomes.

### The Bayesian Regression – Code

We introduce two parameters, one of which is a bias term for the length parameter. We compose the

...
l = numpyro.sample("l", dist.Normal(0, 1))
b_len = numpyro.sample("b_length", dist.Normal(0, 1))
logits = l + b_len * p_len

logits = logits + b_s_len*s_len + b_s_width*s_width + b_p_width*p_width
...

Now to the magical part. We use the logit for the Bernoulli, which composes of all the parameters that we put into the model in interaction with their biases. The observed value in this process is now the target-parameter, our y.

   numpyro.sample("target", dist.Bernoulli(logits=logits), obs=target)

### The complete NumPyro imputation model

def model(s_len, s_width, p_len, p_width, target=None):
b_s_len = numpyro.sample("b_s_len", dist.Normal(0,1))
b_s_width = numpyro.sample("b_s_width", dist.Normal(0,1))
b_p_width = numpyro.sample("b_p_width", dist.Normal(0,1))

# impute length
len_mu = numpyro.sample("p_length_mu", dist.Normal(2.8, 0.2))
len_sigma = numpyro.sample("p_length_sigma", dist.Normal(1.5, 0.2))
len_is_nan = np.isnan(p_len)
len_nan_idx = np.array(np.isnan(p_len).astype(int)).nonzero()[0]

if target is not None:
len_impute = numpyro.param("len_impute", np.zeros(len_is_nan.sum()))
else:
len_impute = numpyro.sample("len_impute", dist.Normal(len_mu[len_nan_idx],
len_sigma[len_nan_idx]))
p_len = ops.index_update(p_len, len_nan_idx, len_impute)

numpyro.sample("p_length", dist.Normal(len_mu, len_sigma), obs=p_len)

l = numpyro.sample("l", dist.Normal(0, 1))
b_len = numpyro.sample("b_length", dist.Normal(0, 1))
logits = l + b_len * p_len

logits = logits + b_s_len*s_len + b_s_width*s_width + b_p_width*p_width
if target is None:
# prediction case
probs = expit(logits)
numpyro.sample("probs", dist.Delta(probs))

numpyro.sample("target", dist.Bernoulli(logits=logits), obs=target)

## Run It

To fit the model we use Bayesian Inference utilizing a MCMC method; the NUTS sampler to be specific. The beauty of the NUTS sampler is that it is a hands-off approach, as it adjusts step sizes by itself, given its stepping through the problem-space. We let it have 5000 warm-up or burn-in runs, which are discarded in the process and let it run for 10000 iterations, just to be sure. Our inference finishes with an acceptance probability of 0.96 or 96% – neat!. We then have a look at the summary what has happened:

mcmc = MCMC(NUTS(model=model), 5000, 10000, num_chains=1)
mcmc.run(random.PRNGKey(42), **iris_data)

mcmc.print_summary()

The output clearly shows convergence of the parameters which we sampled in our inference. Specifically the R-hat values are all >= 1.0 .

We then pull the samples and visualize the posterior – we rely on JAX again for the formatting magic:

samples = mcmc.get_samples()
posterior_mu = jnp.expand_dims(samples['l'], -1) + \
jnp.expand_dims(samples['b_length'], -1)*iris_df["petal length (cm)"].values + \
jnp.expand_dims(samples['b_s_len'], -1)*iris_df["sepal length (cm)"].values + \
jnp.expand_dims(samples['b_s_width'], -1)*iris_df["sepal width (cm)"].values + \
jnp.expand_dims(samples['b_p_width'], -1)*iris_df["petal width (cm)"].values

There we have it. A clear distinction between our two classes. To be fair there is quite a bit of jitter going on above and below the bounds, but the regression-line could not be nicer.

We have shown that we can fill in the blanks in our model through MCMC-sampling.

For the code-savvy curious reader the final code can be found on my github and in Colab.

## What if there are more than two classes, What about Categorical Data?

Great questions, which can be answered with NumPyro’s toolkit as well:

1. Instead of having a Bernoulli trial in the model you can also use a Multinomial for your Bayesian inference.
2. We can also use categorical data for model fitting. That looks quite different. Instead of using the one-dimensional array we rely on as many dimensions as there are categories.

A good example to fit with categorical data instead of continuous one is in the excellent kaggle notebook on Age Imputation given other parameters for the Titanic dataset:

https://www.kaggle.com/fehiepsi/bayesian-imputation-for-age

## The Dos and Don’ts

Please stick to what you have!

Do not invent data. I cannot stress that enough. It is a very fine line that we are walking here.
If your experiments have generated insufficient data or you had to throw some data out, then imputing values, might not be the best way to go. Even when your data might become presentable that way. When talking to a client or an academic journal you don’t get points for: we simulated parameter X and Y to 85.3% – no matter how nice the model looks.
Please do declare what methods, distributions and, assumptions you used to make the models and imputations for maximum transparency.
One recommendation is that you can always take out the data, fit a model without missing measurements and compare it to a model with imputed values in the end. That can make for a compelling story.

## Conclusion

We have walked through data imputation, from the data-frame with holes in it to the final Bayesian inference and fitted posterior distribution.
We have shown how your Bayesian regression can benefit from an imputed vector and what the output looks like including confidence intervals.
Probabilistic Programming can be a powerful tool in your daily work.

Best of luck with your models and analysis!

## References

[1] A. Gelman, J.B. Carlin, et. al., Bayesian Data Analysis . Third Edition.
[2] R. McElreath. Statistical Rethinking. 2016. CRC Press.
[3] Uber Technologies. Regression using NumPyro. 2019. NumPyro Documentations.
[4] M. Betancourt. A Conceptual Introduction to Hamiltonian Monte Carlo. 2020 on ArXiv.
[5] CS 4110 – Programming Languages and Logics. Probabilistic Programming. 2016. Cornell University.

## NGBoost incremental Forecasting

or how to predict something you don’t know with confidence-intervals.

Currently there is a prominent forecasting challenge happening on Kaggle, the M5 Forecasting Challenge . Given are the sales of product-items over a long time range (1913 days or around 5 years), calendar information and price-info. An excellent in-depth analysis by M. Henze can be found here. The goal of the challenge is, to make informed predictions on how the sales for the individual items will continue over the upcoming 28 days.
One can approach this challenge with different methods: LSTMs, MCMC methods and others come to mind. This article will focus on how to generate predictions through a relatively novel approach – NGBoosting.

NGBoost was developed by the Stanford ML group and was published by Tony Duan, Anand Avati and others on arXiv in 2019. This approach optimizes boosting by utilizing natural gradients in contrast to other boosting methods, like LightGBM or XGBoost. It reduces overfitting on known data-points and reaches at least state-of the art performance on standard benchmarking datasets – see [1] for reference and insights. Additionally, after fitting a model, we can generate PDFs (probability density functions) over the data (see below) and one can even get uncertainty intervals.

Making predictions with Boosting algorithms is not as straight-forward as with LSTMs or MCMC methods. An LSTM can generate new data over specified time-steps after the model has learned the underlying structure of the input data. If you fit a boosting model and simply hit the “predict 28 days”-button, it would spit out a consistent value 28 times.
So lets get the party started in your favorite editor or notebook:

!pip install ngboost
from ngboost import NGBRegressor

df = # load the dataset as provided on kaggle ...
X, y = # split the df into X and y as described below
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

ngb_clf = NGBRegressor(n_estimators=100, Score=LogScore, minibatch_frac=0.5)
ngb_clf.fit(X_train, y_train)

ngb_clf.predict(X_test)

Now that we have predicted something we want to look into the future – one look at a time. The approach to get adjusted predictions, is to take predicted data into account. That means we refit the model and make the next prediction incrementally. You can find a possible algorithm for this in Fig. 3. Lets say we have our input data in the format below (see Fig. 1). What we need in addition is an empty column for our next day (e.g. day 1914).

We then wrangle with our data (you might use python’s pandas, the R-tidyverse or maybe you like Excel) and come up with the beautiful dataframe below (see Fig. 2), that means we have: (1.) the sales as our y-value or what we want to fit and predict, (2.) one-hot-encoded the categories that we are interested in, e.g. time-features or price-features, etc. and (3.) label-encoded the item_id (this is optional, but helps to assign the predictions later). Now we follow the pseudo-code below:

• We split the dataframe in the days that we know (e.g. 1913+i where i is the timestep we are working with) lets call this X_train and the new data that we don’t know, called X_pred
• We train the regressor on the data that we have -> X_train.
• We call the predict function of the fitted regressor on X_pred
• We repeat the process until we have all 28 days

The user-guide to what ngboost offers and how that works can be found on the Stanford ML groups site.
The complete code to our solution and a notebook will be provided through a github-link, once the challenge is concluded in June 2020.

Happy Coding!

## References

1. Tony Duan, Anand Avati, et. al. NGBoost: Natural Gradient Boosting for Probabilistic Prediction. v3. 2020 available on arXiv
2. The NGBoost User Guide
3. The M5 Kaggle Challenge