## Improve Your Model with Missing Data | Imputation with NumPyro

Make the best of missing data the Bayesian way. Improve model performance and make comparative benchmarks using Monte Carlo methods.

## The Ugly Data

How do you handle missing data, gaps in your data-frames or noisy parameters?
You have spent hours at work, in the lab or in the wild to generate or curate a dataset given an interesting research question or hypothesis. Terribly enough, you find that some of the measurements for a parameter are missing!
Another case that might throw you off is unexpected noise that was introduced at some point in the experiment and has doomed some of your measurements to be extreme outliers.
Sometimes you cannot just exile a whole parameter from your analysis, simply because it has 25% missing entries or maybe entails a “does not apply” field in your questionnaire.

Fear not, there is a way you can make the most of your dataset anyways.
In this post we will use Probabilisitc Programming [5] to fill the gaps that your model might face.

## Simulate Missing Data

Lets build a problem from scratch. The studious reader is most likely familiar with the Iris flower data-set. It is somewhat part of the “hello-world” in the ML community; next to the well-known MNIST that is.
We use sklearn.datasets to get the data and put it into a pandas dataframe like so.:

iris_data = datasets.load_iris()
iris_df = pd.DataFrame(data=np.c_[iris_data['data'], iris_data['target']],
columns=iris_data['feature_names'] + ['target'])

To make this example more explicit lets select for only 2 out of the given 3 species.:

iris_df["target"] = iris_df.target.astype("int")
iris_df = iris_df[iris_df.target.isin([0, 1])]

Now you should see 100 measurements (rows that are our n) of the Iris flower data.
We now take out 50% of the petal length entries by chance and assign NaN values with a bit of numpy and pandas magic.

random_vec = np_random(iris_df["petal length (cm)"].shape)<0.5 # 50% marked by numpy random vector on the column
iris_df["petal length (cm)"] = iris_df["petal length (cm)"].where(random_vec, other=np.nan)

## Why do we care about all of our measurements?

When we plot our data, we can clearly see that the petal length and petal width can help us identify which species (target-column) we have (see Fig. 1).

We can make an even clearer distinction by regressing on the groups using just the vector which we have decimated earlier.

Too bad about half of the data is missing. So what do we do now?

First off, there are different degrees of missing data:

Missing Completely at Random (MCAR) and Missing at Random (MAR).

For a formal introduction to this see Bayesian Data Analysis [1] Ch. 18

One important distinction is that MCAR treats your missing parameters as independent of your observed measurements (lets call them y).
In our example we rely on our parameter being missing at random or MAR. That means we don’t think that the chance of a value gone missing is dependent on the parameter property BUT we think that there is potentially some relation between our length measure and the target species.

## A Regression Model to Rule them all

First lets extract some basic info from the data of what we have so far:

>>> print("Petal Length μ = {}".format(iris_df["petal length (cm)"].mean()))
>>> Petal Length μ = 2.8072727272727267
>>> print("Petal Length σ^2 = {}".format(iris_df["petal length (cm)"].std()))
>>> Petal Length σ^2 = 1.4923079993698234

Now we build the model to do the imputation. For this we use the tool NumPyro, which is light-weight probabilistic numpy backend to Pyro – a heavy duty probabilistic language.
This comes in handy after formulating the model. We will use MCMC methods to fit the model and simulate values for missing entries.
To make this more applicable to a use case we combine this with a Bayesian regression-model. That means we want to model the distinction in species given all parameters that we know. Our parameters therefore are:

sepal length := s_len, sepal width := s_width, petal width := p_width and, lastly petal length := p_len, which includes our missing data.

For each of those parameters we introduce a learnable bias which comes from a regular Normal distribution aka Gaussian for the sake of simplicity. So lets get started.:

def model(s_len, s_width, p_len, p_width, target=None):
b_s_len = numpyro.sample("b_s_len", dist.Normal(0,1))
b_s_width = numpyro.sample("b_s_width", dist.Normal(0,1))
b_p_width = numpyro.sample("b_p_width", dist.Normal(0,1))
...

One can do a bit of tweaking for the bias parameters above, but for our initial runs the normal Gaussian worked reasonably well.
Now the delicate part. We introduce the distribution which we assume lies under our half-destroyed parameter. As with a lot of things we assume this to be Gaussian, which in our case is a reasonable assumption. Therefore we take the get loc=\mu (2.8 +/- 0.2) and scale=\sigma^2 (1.5+/-0.2) which we had computed earlier:

...
len_mu = numpyro.sample("p_length_mu", dist.Normal(2.8, 0.2))
len_sigma = numpyro.sample("p_length_sigma", dist.Normal(1.5, 0.2))
...

The informed reader clearly sees how the \mu and \sigma themselves are probabilistic parameters for themselves which we model in the process. For both we assume Gaussians but more rigorous distributions could be chosen.
Now we have to find our targets which we model. Hence we localize our missing positions in the vector with numpy.

...
len_is_nan = np.isnan(p_len)
len_nan_idx = np.array(np.isnan(p_len).astype(int)).nonzero()[0]

if target is not None:
len_impute = numpyro.param("len_impute", np.zeros(len_is_nan.sum()))
else:
len_impute = numpyro.sample("len_impute", dist.Normal(len_mu[len_nan_idx],
len_sigma[len_nan_idx]))
...

Additionally, we tell NumPyro to make the imputation-value (len_impute) a learnable parameter and fill with as many zeroes as values we are missing.
On a short sidenote: our model has a predictive mode, that occurs when no target (aka species) is provided. When we invoke the predictive capabilities, we sample from the introduced distribution with the sampled \mu and sampled \sigma instead. Make sure to only apply this on the NaN positions in your vector! That is the meaning of the else clause:

numpyro.sample("len_impute", dist.Normal(len_mu[len_nan_idx], len_sigma[len_nan_idx])

We then put the imputed values into the right places and give it the inference sampling, with the observed length values to do its optimization. For this we use the JAX library with some efficient index updating – linear-algebra style.:

...
p_len = ops.index_update(p_len, len_nan_idx, len_impute)

numpyro.sample("p_length", dist.Normal(len_mu, len_sigma), obs=p_len)
...

In our length-sampling we include the actual length measures as the observations in the process .

## Putting Things Together | Imputed Values into the Linear Model

For this model to be a proper linear model, we have to fit our target y (the species).
Since we have two classes, we do this by invoking a Bernoulli distribution.
We therefore ask in a Bernoulli process whether we are looking at y=0 or y=1 – two classes, two potential outcomes.

### The Bayesian Regression – Code

We introduce two parameters, one of which is a bias term for the length parameter. We compose the

...
l = numpyro.sample("l", dist.Normal(0, 1))
b_len = numpyro.sample("b_length", dist.Normal(0, 1))
logits = l + b_len * p_len

logits = logits + b_s_len*s_len + b_s_width*s_width + b_p_width*p_width
...

Now to the magical part. We use the logit for the Bernoulli, which composes of all the parameters that we put into the model in interaction with their biases. The observed value in this process is now the target-parameter, our y.

   numpyro.sample("target", dist.Bernoulli(logits=logits), obs=target)

### The complete NumPyro imputation model

def model(s_len, s_width, p_len, p_width, target=None):
b_s_len = numpyro.sample("b_s_len", dist.Normal(0,1))
b_s_width = numpyro.sample("b_s_width", dist.Normal(0,1))
b_p_width = numpyro.sample("b_p_width", dist.Normal(0,1))

# impute length
len_mu = numpyro.sample("p_length_mu", dist.Normal(2.8, 0.2))
len_sigma = numpyro.sample("p_length_sigma", dist.Normal(1.5, 0.2))
len_is_nan = np.isnan(p_len)
len_nan_idx = np.array(np.isnan(p_len).astype(int)).nonzero()[0]

if target is not None:
len_impute = numpyro.param("len_impute", np.zeros(len_is_nan.sum()))
else:
len_impute = numpyro.sample("len_impute", dist.Normal(len_mu[len_nan_idx],
len_sigma[len_nan_idx]))
p_len = ops.index_update(p_len, len_nan_idx, len_impute)

numpyro.sample("p_length", dist.Normal(len_mu, len_sigma), obs=p_len)

l = numpyro.sample("l", dist.Normal(0, 1))
b_len = numpyro.sample("b_length", dist.Normal(0, 1))
logits = l + b_len * p_len

logits = logits + b_s_len*s_len + b_s_width*s_width + b_p_width*p_width
if target is None:
# prediction case
probs = expit(logits)
numpyro.sample("probs", dist.Delta(probs))

numpyro.sample("target", dist.Bernoulli(logits=logits), obs=target)

## Run It

To fit the model we use Bayesian Inference utilizing a MCMC method; the NUTS sampler to be specific. The beauty of the NUTS sampler is that it is a hands-off approach, as it adjusts step sizes by itself, given its stepping through the problem-space. We let it have 5000 warm-up or burn-in runs, which are discarded in the process and let it run for 10000 iterations, just to be sure. Our inference finishes with an acceptance probability of 0.96 or 96% – neat!. We then have a look at the summary what has happened:

mcmc = MCMC(NUTS(model=model), 5000, 10000, num_chains=1)
mcmc.run(random.PRNGKey(42), **iris_data)

mcmc.print_summary()

The output clearly shows convergence of the parameters which we sampled in our inference. Specifically the R-hat values are all >= 1.0 .

We then pull the samples and visualize the posterior – we rely on JAX again for the formatting magic:

samples = mcmc.get_samples()
posterior_mu = jnp.expand_dims(samples['l'], -1) + \
jnp.expand_dims(samples['b_length'], -1)*iris_df["petal length (cm)"].values + \
jnp.expand_dims(samples['b_s_len'], -1)*iris_df["sepal length (cm)"].values + \
jnp.expand_dims(samples['b_s_width'], -1)*iris_df["sepal width (cm)"].values + \
jnp.expand_dims(samples['b_p_width'], -1)*iris_df["petal width (cm)"].values

There we have it. A clear distinction between our two classes. To be fair there is quite a bit of jitter going on above and below the bounds, but the regression-line could not be nicer.

We have shown that we can fill in the blanks in our model through MCMC-sampling.

For the code-savvy curious reader the final code can be found on my github and in Colab.

## What if there are more than two classes, What about Categorical Data?

Great questions, which can be answered with NumPyro’s toolkit as well:

1. Instead of having a Bernoulli trial in the model you can also use a Multinomial for your Bayesian inference.
2. We can also use categorical data for model fitting. That looks quite different. Instead of using the one-dimensional array we rely on as many dimensions as there are categories.

A good example to fit with categorical data instead of continuous one is in the excellent kaggle notebook on Age Imputation given other parameters for the Titanic dataset:

https://www.kaggle.com/fehiepsi/bayesian-imputation-for-age

## The Dos and Don’ts

Please stick to what you have!

Do not invent data. I cannot stress that enough. It is a very fine line that we are walking here.
If your experiments have generated insufficient data or you had to throw some data out, then imputing values, might not be the best way to go. Even when your data might become presentable that way. When talking to a client or an academic journal you don’t get points for: we simulated parameter X and Y to 85.3% – no matter how nice the model looks.
Please do declare what methods, distributions and, assumptions you used to make the models and imputations for maximum transparency.
One recommendation is that you can always take out the data, fit a model without missing measurements and compare it to a model with imputed values in the end. That can make for a compelling story.

## Conclusion

We have walked through data imputation, from the data-frame with holes in it to the final Bayesian inference and fitted posterior distribution.
We have shown how your Bayesian regression can benefit from an imputed vector and what the output looks like including confidence intervals.
Probabilistic Programming can be a powerful tool in your daily work.

Best of luck with your models and analysis!

## References

[1] A. Gelman, J.B. Carlin, et. al., Bayesian Data Analysis . Third Edition.
[2] R. McElreath. Statistical Rethinking. 2016. CRC Press.
[3] Uber Technologies. Regression using NumPyro. 2019. NumPyro Documentations.
[4] M. Betancourt. A Conceptual Introduction to Hamiltonian Monte Carlo. 2020 on ArXiv.
[5] CS 4110 – Programming Languages and Logics. Probabilistic Programming. 2016. Cornell University.