A simple, yet meaningful Pyro model to illustrate change-points over time.
One profound claim and observations by the media is, that the rate of suicides for younger people in the UK have risen from the 1980s to the 2000s. You might find it generally on the news , in publications or it is just an accepted truth by the population. But how can you make this measurable?
In order to make this claim testable we look for data and find an overview of the suicide rates, specifically England and Wales, at the Office for National Statistics (UK) together with an overall visualization.
Making an assumption tangible
Generally, one type of essential questions to ask – in everyday life, business or academia – is when changes have occurred. You are under the assumption that something has fundamentally changed over time. In order to prove that you have to quantify it. So you get some data on the subject-matter and build a model that shows you when changes have occurred as well as their magnitude. We are going to look at exactly how to do that.
Millennial Suicides Specifically
Now that we have our data, we take the average for the age-group of millennials. That entails ages in the range of 25 to 35, for the sake of simplicity. Given our dataset, we can visualize the average suicide count for millennials since the 1980s like this:
suicide_millenials_df = suicide_data[suicide_data.Age.isin(range(23,35))].groupby(["Year"]).mean().reset_index() # round values to closest integer suicide_millenials_df["Suicides"] = suicide_millenials_df.Suicides.round() suicide_millenials_df.head()
|Year||avrg. Suicide per 100.000|
From this data we can clearly see a steady increase over time until the very early 2000s followed by a decline until the 2010s. The inquisitive side of us wants to know when exactly that happened so that we can go out and look for the root causes and the underlying issues associated with those trends. Lets build a model to find out exactly in which years to look for that substantial change. To look for a specific change-point we will use the Pyro PPL as a probabilistic framework.
We will use the mathematical underpinnings that are described in . This entails:
a uniform prior distribution over the years T , the rate of change from a half-normal distribution (the positive side of the Gaussian) in inversion-bracket notation. Looking at the data we set a large scale for our rate as we want to capture the change from 75 to 130 – something around 50. Lastly we find the observed rate of suicides through a Poisson regression.
Note: The Poisson regression entails, that we will sample from a normal distribution.
These parameters can be adjusted – maybe you look for different scales of changes, either across time or effect-size. One might consider a half-Cauchy distribution for more extreme values, maybe an increase or decrease on the scale.
The Pyro Model
def model(years, suicides): σ = pyro.param('σ', torch.ones(data_len)) T = pyro.sample('change', dist.Uniform(1981, 2017)) grp = (years > T) * 1 # Independent events with pyro.plate('rate', 2): μ = pyro.sample('μ', dist.HalfNormal(scale=50)) y_obs = pyro.sample('obs', dist.Normal(μ[grp], σ), obs=suicides)
The grp gives us the index for our bracketed .
The model observations are then sampled from the Normal distribution taking into account and a scale==1 (last line in the model).
Before we can run this model we convert our dataframe to Pyro readable tensors using PyTorch, like so:
years = torch.from_numpy(suicide_millenials_df.Year.values) suicides = torch.from_numpy(suicide_millenials_df.Suicides.values) data_len = len(suicide_millenials_df)
For the fitting procedure we perform Bayesian Inference utilizing MCMC methods. We sample for 1000 iterations and let the sampler burn-in up for 300 iterations.
SAMPLES = 1000 WARMUP = 300 nuts_kernel = NUTS(model) mcmc = MCMC(nuts_kernel, num_samples=SAMPLES, warmup_steps=WARMUP) mcmc.run(years, suicides)
We specifically use the hands-off NUTS sampler to perform inference and find values for our Pyro parameters and recommend that you do the same.
Beware that in contrast to Variational inference approaches the MCMC takes its time. After training we find a relatively high acceptance probability of around 91.2% and a stepsize of around 0.00198.
Checking the model summary we see that everything is in order- no divergences – maybe our R-hat values are even too high.
mean std median 5.0% 95.0% n_eff r_hat change 1995.40 2.10 1995.40 1991.72 1998.24 3.04 1.94 μ 96.21 10.41 97.56 93.81 101.23 20.16 1.08 μ 87.62 2.59 87.90 83.01 90.94 3.03 1.95
Now with the samples at hand, we can display when a change-point has occurred most frequently over time:
We go ahead and plot our different rates over the given time
def pl(pt): grp = (years > pt['change']) * 1 line = plt.plot(years, pt['rate0'] * (1 - grp) + pt['rate1'] * grp, color='red', alpha=0.005) df.apply(pl, axis=1)
First we ran the previously described model with a 1000 samples. This uncovered the change-point around the year 1995 to indicate the upcoming downwards trend. The output can be seen in Fig.3 where the rate is indicated as a red-line. Here model has quantified the change-point for the better development.
We took another shot at running the model for which we took 2000 samples and the model converged on a time-range in the years of ’86 to ’89 for yet another change-point. This marks the potential turning-point for an upwards trend. If we want to uncover the cause behind an increase in suicides among Millennials we have to look for reasons in that time-range or earlier changes, of which the effects have manifested themselves into this time.
We have shown that there is not just one change happening over time. There is a range of points for an increase in the rate and a range of points for a decreasing rate. One might extend the model to pick up more than one change-point, making the three dimensional or fitting an entirely different model.
Concluding the Model
We have shown how to get from data, to a concrete formulation of a change in time and their underlying probabilities. This was directly translated in Pyro and through sampling we could infer concrete change-points. Probabilistic Programming is a powerful engine.
The complete notebook can be found here: https://github.com/RMichae1/PyroStudies/blob/master/changepoint_analysis.ipynb
The Gravity on the Subject
Though this was intended as an introductory piece on the topic of Data Science, the topic of self-harm and death is a grave one. In case you or a loved one are affected there is help and resources out there. So I will leave you with this message here:
- Office for National Statistics, UK. Middle-aged generation most likely to die by suicide and drug poisoning. Dataset on suicide-rates. 2019.
- C. Scherrer. Bayesian Changepoint Detection with PyMC3. 2018.
- M. Hoffman, A. Gelman. The No-U-Turn Sampler. JMLR v15. 2014.