## Compute the Incomputable | What SVI and ELBO do

One reason why Bayesian Modeling works with real world data. The approximate light-house in the sea of randomness.

When you want to gain more insights into your data you rely on programming frameworks that allow you to interact with probabilities. All you have in the beginning is a collection of data-points. It is just a glimpse into the underlying distribution from which your data comes. However, you do not just want to get simple data-points out in the end. What you want is elaborate, talkative density distributions with which you can perform tests. For this you use probabilistic frameworks like TensorFlow Probability, Pyro or STAN to compute posteriors of probabilities.
As we will see, the plain computation is not always feasible and we rely on Markov Chain Monte Carlo (MCMC) methods or Stochastic Variational Inference (SVI) to solve those problems. Especially over large data-sets or even every-day medium sized datasets we have to perform the magic of sampling and inference to compute values and fit models. If these methods would not exist we would be stuck with a neat model that we had thought up, but no way to know if it actually makes sense.
We have built such a model in a previous article on forecasting .
In this example we will look at inference and sampling from a Pyro point of view.

In short, instead of brute-forcing the computation Pyro makes an estimate. We can sample from that estimate and we can then compare that estimate to the actual data given our model. Then we make improvements – step by step. But why can’t we just compute the whole thing? I’m glad you asked.

## From infinity to constant

Lets say we have a dataset . We want to build a model ( ) that accurately describes our data. For that we need a set of variables. These variables correspond to events that are random and we call them latent random variables e.g. .
Now to see how well our model does we rely on posterior probability distribution . If this is close to the real deal the likelihood of this posterior, compared to the observed data will be low. That is a good starting point, now how do I compute this ? In order for this to work, we have to do inference like this: The math-savvy reader already sees that the right part of the equation above is in general not computable. We face the terror of the almighty integral over something we don’t know yet and given a continuous variable might be infinite. That is quite big.
What we can do instead is use a variational distribution; lets call it Q.
In an ideal world this approximate Q would be as close as possible to our P which we are trying to compute. So how do we get there?
One thing that we need first is a measure of how similar P and Q are. Enter the Kullback Leibler Divergence — you can find the legendary paper by Kullback and Leibler in the Annals of Mathematical Statistics .
Our tool to compare the distributions, the Kullback Leibler Divergence is
defined as : This measure tells us how similar the distributions P and Q are. If they are similar, the KL is low and if they are very different, the divergence is large.

Now this measure requires us to know both distributions P and Q. However we don’t know P yet. After all, that is what we are computing and so far our Q is a wild guess.
After some reformulation we end up with this (see  for a step-by-step solution): The first part: is what is called the evidence lower bound or ELBO. The second part: is a constant. That is much better. All we have to do now is maximize the ELBO to get minimal divergence between our P and Q.

Minimizing KL divergece corresponds to maximizing ELBO and vice versa.

## Learn Q with Deep-Learning.

Now that we have the basic theory laid out, what do we do about our Q?
We can learn this estimate by using some form of neural network. We already have TensorFlow or PyTorch in the case of Pyro at our disposal. Compared to the model we can build something that models our variational distribution. As soon as we have that we can use stochastic gradient descent to optimize and get our model. So lets fire away.

We use our data as input and get Q as an output from the network. If we have this as a Gaussian then the network gives us a mean and standard deviation .

If we picture this process in a very simple form, rolling out over time it looks something like this:

Over all our data we then observe various random events which we compare to what we have sampled at that point in time. We try to minimize the divergence – which means to maximize the evidence lower bound. At the last point in time the ELBO is maximized or the divergence is minimal. We then have fitted our model.

## Talk Pyro to me

Lets take a simple example case in which we use a fit a model-function , given the latent random variable which comes from a -distribution. The events we are observing are happening Bernoulli-random. This means that the event either occurs or it doesn’t. Lastly the events are i.i.d. – that is independent and identically distributed which we write as the pyro.plate notation. So that our model is:

 def model(self):
# sample z from the beta prior
f = pyro.sample("z", dist.Beta(10, 10))
# plate notion for conditionally independent events given f
with pyro.plate("data"):
# observe all datapoints using the bernoulli distribution
pyro.sample("obs", dist.Bernoulli(f), obs=self.data)

So far this is standard Pyro and in our forecasting model we did not specify a guide even though it was still there. You can take full control of your model process, by specifying the guide as such. In most cases this just looks like your model only slightly different. In this guide the latent z comes from this:

 def guide(self):
# register the two variational parameters with pyro
alpha_q = pyro.param("alpha_q", torch.tensor(15),
constraint=constraints.positive)
beta_q = pyro.param("beta_q", torch.tensor(15),
constraint=constraints.positive)
pyro.sample("latent_z", NonreparameterizedBeta(alpha_q, beta_q))

Lastly we run our inference by using the SVI object. We provide the previously specified model and the specified guide. Also we rely on the standard Adam optimizer for stochastic gradient descent.

def inference(self):
# clear everything that might still be around
pyro.clear_param_store()
# setup the optimizer and the inference algorithm
optimizer = optim.Adam({"lr": .0005, "betas": (0.93, 0.999)})
svi = SVI(self.model, self.guide, optimizer, loss=TraceGraph_ELBO())

You’re all set. Now lets look at some output. For this to take some shape I have used a simple Bayesian regression task and compute the posterior for the model-function (see my github for the specifics).
We utilize Pyro’s SVI but also a Markov Chain Monte Carlo procedure with a NUTS sampler.

## The Ugly Truth | Comparing SVI to MCMC

So far everything sounded very promising – almost too good to be true. With everything in life you should always check your results. This example below (Fig. 2-3) shows the fitted posterior distribution of the previously mentioned regression-task using SVI and MCMC side by side.

The in-depth understanding of how and why Markov methods and inference differ is topic of various research. As this is a deep-dive it will be the topic for another article. The only thing that we can conclude at this point is that the sampling routine has an evident impact on our final model.

## What do I do with SVI

We have looked a seemingly unfeasible problem in the face and taken a look at the theory of the underlying the inference process to figure out why this works. You can use deep-learning frameworks like TF and PyTorch to build your probabilistic models by approximating another model-distribution and use MCMC or SVI for this. The goal of it all is to minimize divergence (or maximize the lower bound) between our two approximated posteriors.
We should always look at our output and ask ourselves if what we see makes sense. The sampling procedure apparently influences our posterior outcome – so we should be careful.
All in all SVI allows us to take any data-set and perform Bayesian Modeling with it. If you ask me, that is a pretty nice feature.

## References

1. S. Kullback and R.Leibler. On Information and Sufficiency. 1951 in Annals of Mathematical Statistics.
2. D. Wingate and T. Weber. Automated Variational Inference in Probabilistic Programming. 2013 on arxiv.
3. Pyro Documentation – SVI tutorial I
4. Pyro Documentation – SVI tutorial II
5. Pyro Documentation – SVI tutorial III
6. Reza Babanezhad. Stochastic Variational Inference. UBC
7. D. Madras Tutorial Stochastic Variational Inference. University of Toronto. 2017

#### The Complete Pyro Example

The metioned example simply serves as a rough overview of how a Pyro class is set up. The functional code can be found in the Pyro documentation (see ).

class BetaExample:
def __init__(self, data):
# ... specify your model needs here ...
# the dataset needs to be a torch vector
self.data = data

def model(self):
# sample z from the beta prior
f = pyro.sample("z", dist.Beta(10, 10))
# plate notion for conditionally independent events given f
with pyro.plate("data"):
# observe all datapoints using the bernoulli likelihood
pyro.sample("obs", dist.Bernoulli(f), obs=self.data)

def guide(self):
# register the two variational parameters with pyro
alpha_q = pyro.param("alpha_q", torch.tensor(15),
constraint=constraints.positive)
beta_q = pyro.param("beta_q", torch.tensor(15),
constraint=constraints.positive)
pyro.sample("latent_z", NonreparameterizedBeta(alpha_q, beta_q))

def inference(self):
# clear the param store
pyro.clear_param_store()
# setup the optimizer and the inference algorithm
optimizer = optim.Adam({"lr": .0005, "betas": (0.93, 0.999)})
svi = SVI(self.model, self.guide, optimizer, loss=TraceGraph_ELBO())