Millennial Suicides – an Analysis about Change

A simple, yet meaningful Pyro model to illustrate change-points over time.

Walking alone. A photo by Mitchel Lensink on Unsplash.

One profound claim and observations by the media is, that the rate of suicides for younger people in the UK have risen from the 1980s to the 2000s. You might find it generally on the news , in publications or it is just an accepted truth by the population. But how can you make this measurable?
In order to make this claim testable we look for data and find an overview of the suicide rates, specifically England and Wales, at the Office for National Statistics (UK) together with an overall visualization.

Making an assumption tangible

Generally, one type of essential questions to ask – in everyday life, business or academia – is when changes have occurred. You are under the assumption that something has fundamentally changed over time. In order to prove that you have to quantify it. So you get some data on the subject-matter and build a model that shows you when changes have occurred as well as their magnitude. We are going to look at exactly how to do that.

Millennial Suicides Specifically

Now that we have our data, we take the average for the age-group of millennials. That entails ages in the range of 25 to 35, for the sake of simplicity. Given our dataset, we can visualize the average suicide count for millennials since the 1980s like this:

suicide_millenials_df = suicide_data[suicide_data.Age.isin(range(23,35))].groupby(["Year"]).mean().reset_index()
# round values to closest integer
suicide_millenials_df["Suicides"] = suicide_millenials_df.Suicides.round()
suicide_millenials_df.head()
Yearavrg. Suicide per 100.000
198185.0
198283.0
1983
What the head of your dataframe should look like.
Fig. 1 – The average suicides (per 100.000 people) in the age group 25 to 35 for England and Wales from 1981 to 2017.

From this data we can clearly see a steady increase over time until the very early 2000s followed by a decline until the 2010s. The inquisitive side of us wants to know when exactly that happened so that we can go out and look for the root causes and the underlying issues associated with those trends. Lets build a model to find out exactly in which years to look for that substantial change. To look for a specific change-point we will use the Pyro PPL as a probabilistic framework.

The How

We will use the mathematical underpinnings that are described in [2]. This entails:
a uniform prior distribution over the years T , the rate of change \mu from a half-normal distribution (the positive side of the Gaussian) in inversion-bracket notation. Looking at the data we set a large scale for our rate as we want to capture the change from 75 to 130 – something around 50. Lastly we find the observed rate of suicides through a Poisson regression.
Note: The Poisson regression entails, that we will sample from a normal distribution.

T \sim \mathcal{U}(1981,2017), \mu_0\sim\mathcal{N}^+(0, 50), \mu_1\sim\mathcal{N}^+(0, 50), n_t\sim\text{Poisson}(\mu_{[t>T]})

These parameters can be adjusted – maybe you look for different scales of changes, either across time or effect-size. One might consider a half-Cauchy distribution for more extreme values, maybe an increase or decrease on the scale.

The Pyro Model

def model(years, suicides):
    σ = pyro.param('σ', torch.ones(data_len))
    T = pyro.sample('change', dist.Uniform(1981, 2017))
    grp = (years > T) * 1
    # Independent events 
    with pyro.plate('rate', 2):
        μ = pyro.sample('μ', dist.HalfNormal(scale=50))
    y_obs = pyro.sample('obs', dist.Normal(μ[grp], σ), obs=suicides)

The grp gives us the index for our bracketed \mu .
The model observations are then sampled from the Normal distribution taking into account \mu and a scale=\sigma=1 (last line in the model).

Before we can run this model we convert our dataframe to Pyro readable tensors using PyTorch, like so:

years = torch.from_numpy(suicide_millenials_df.Year.values)
suicides = torch.from_numpy(suicide_millenials_df.Suicides.values)
data_len = len(suicide_millenials_df)

For the fitting procedure we perform Bayesian Inference utilizing MCMC methods. We sample for 1000 iterations and let the sampler burn-in up for 300 iterations.

SAMPLES = 1000
WARMUP = 300
nuts_kernel = NUTS(model)
mcmc = MCMC(nuts_kernel, num_samples=SAMPLES, warmup_steps=WARMUP)
mcmc.run(years, suicides)

We specifically use the hands-off NUTS sampler to perform inference and find values for our Pyro parameters and recommend that you do the same.
Beware that in contrast to Variational inference approaches the MCMC takes its time. After training we find a relatively high acceptance probability of around 91.2% and a stepsize \epsilon of around 0.00198.
Checking the model summary we see that everything is in order- no divergences – maybe our R-hat values are even too high.

            mean       std    median      5.0%     95.0%     n_eff     r_hat
    change   1995.40      2.10   1995.40   1991.72   1998.24      3.04      1.94
      μ[0]     96.21     10.41     97.56     93.81    101.23     20.16      1.08
      μ[1]     87.62      2.59     87.90     83.01     90.94      3.03      1.95

Now with the samples at hand, we can display when a change-point has occurred most frequently over time:

Fig. 2 – Histogram of the occurrence of the change-point across years.

We go ahead and plot our different rates over the given time

def pl(pt):
    grp = (years > pt['change']) * 1
    line = plt.plot(years, pt['rate0'] * (1 - grp) + pt['rate1'] * grp, 
      color='red', alpha=0.005)    

df.apply(pl, axis=1)
Fig. 3 – Change-rate over time (red-line) for 1000 samples. There is a change around 1995 to 1996 to mark the downwards trend.
Fig. 4 – Another rate of change over time as the red line. We see over three events between 1986 to 1989 the change has manifested itself. This run took 2000 samples.

First we ran the previously described model with a 1000 samples. This uncovered the change-point around the year 1995 to indicate the upcoming downwards trend. The output can be seen in Fig.3 where the rate is indicated as a red-line. Here model has quantified the change-point for the better development.
We took another shot at running the model for which we took 2000 samples and the model converged on a time-range in the years of ’86 to ’89 for yet another change-point. This marks the potential turning-point for an upwards trend. If we want to uncover the cause behind an increase in suicides among Millennials we have to look for reasons in that time-range or earlier changes, of which the effects have manifested themselves into this time.

Open Questions

We have shown that there is not just one change happening over time. There is a range of points for an increase in the rate and a range of points for a decreasing rate. One might extend the model to pick up more than one change-point, making the \mu three dimensional or fitting an entirely different model.

Concluding the Model

We have shown how to get from data, to a concrete formulation of a change in time and their underlying probabilities. This was directly translated in Pyro and through sampling we could infer concrete change-points. Probabilistic Programming is a powerful engine.

The complete notebook can be found here: https://github.com/RMichae1/PyroStudies/blob/master/changepoint_analysis.ipynb

The Gravity on the Subject

Though this was intended as an introductory piece on the topic of Data Science, the topic of self-harm and death is a grave one. In case you or a loved one are affected there is help and resources out there. So I will leave you with this message here:

References

  1. Office for National Statistics, UK. Middle-aged generation most likely to die by suicide and drug poisoning. Dataset on suicide-rates. 2019.
  2. C. Scherrer. Bayesian Changepoint Detection with PyMC3. 2018.
  3. M. Hoffman, A. Gelman. The No-U-Turn Sampler. JMLR v15. 2014.

The Bayesian Toolkit | 3 Probabilistic Frameworks

Account for uncertainties in your programs and pipelines or build probabilistic models.

The tools to build, train and tune your probabilistic models. Photo by Patryk Grądys on Unsplash.

We should always aim to create better Data Science workflows.
But in order to find out how to achieve that we should find out what is lacking.

Classical ML workflows are missing something

Classical Machine Learning is pipelines work great. The usual workflow looks like this:

  1. Have a use-case or research question
  2. build and curate a dataset that relates to the use-case or research question
  3. build a model
  4. train and validate the model
    1. maybe even cross-validate, while grid-searching hyperparameters.
  5. test the fitted model
  6. deploy model for the use-case or answer the research question

As you might have noticed, one severe shortcoming is to account for certainties of the model and confidence over the output.

Certain about being Uncertain

After going through this workflow and given that the model results looks sensible, we take the output for granted. So what is missing?
In this basic approach we have not accounted for missing or shifted data.
Some might interject and say that they have some augmentation routine for their data. That’s great – but did you formalize it?
What about building a prototype before having seen the data – something like a modeling sanity check? Simulate some data and build a prototype before you invest resources in gathering data and fitting insufficient models. This was already pointed out by Andrew Gelman in his Keynote at the NY PyData Keynote 2017.
Get better intuition and parameter insights! For deep-learning models you need to rely on a platitude of tools and plotting libraries to explain what your model has learned.
For probabilistic approaches you can get insights on parameters quickly.
So what tools do we want to use in a production environment?

I. STAN – The Statisticians Choice

STAN is a well established framework and tool for research. Strictly speaking, this framework has its own probabilistic language and the Stan-code looks more like a statistical formulation of the the model you are fitting.
Once you have built and done inference with your model you save everything to file, which brings the great advantage that everything is reproducible.
STAN is well supported in R through RStan, Python with PyStan, and other interfaces.
In the background the framework compiles the model into efficient C++ code.
In the end, the computation is done through MCMC Inference (e.g. NUTS sampler) which is easily accessible and even Variational Inference is supported.
If you want to get started with this Bayesian approach we recommend the case-studies.

II. Pyro – The Programming Approach

My personal favorite tool for deep probabilistic models is Pyro. This language was developed and is maintained by the Uber Engineering division. The framework is backed by PyTorch so that the modeling that you are doing integrates seamlessly with the PyTorch models which you might already have.
Writing your models and training writes like any other Python code with some special rules and formulations that come with the probabilistic approach.

As an overview we have already compared STAN and Pyro Modeling on a small problem-set in a previous post: http://laplaceml.com/single-parameter-models/ .

Pyro excels, when you want to find randomly distributed parameters, sample data and perform efficient inference.
As this language is under constant development not everything you are working on might be documented. There are a lot of use-cases and already existing model-implementations and examples. Also the documentation gets better by the day.
The examples and tutorials are a good place to start, especially when you are new to the field of probabilistic Programming and statistical modeling.

III. TensorFlow Probability – Google’s Favorite

When you talk Machine Learning, especially deep learning, many people think TensorFlow. Since TensorFlow is backed by Google developers you can be certain, that it is well maintained and has excellent documentation.
When you have TensorFlow or better yet TF2 in your workflows already you are all set to use TF Probability also.
Josh Dillon made an excellent case why probabilistic modeling is worth the learning curve and why you should consider TensorFlow Probability at the Tensorflow Dev Summit 2019:

TensorFlow Probability: Learning with confidence (TF Dev Summit ’19) by TensorFlow Channel

And here is a short Notebook to get you started on writing Tensorflow Probability Models:

https://colab.research.google.com/github/tensorflow/probability/blob/master/tensorflow_probability/g3doc/_index.ipynb

Honorable Mentions

PyMC3 is an openly available python probabilistic modeling API. It has vast application in research, has great community support and you can find a number of talks on probabilistic modeling on youtube to get you started.

If you are programming Julia, take a look at Gen. This is also openly available and in very early stages. So documentation is still lacking and things might break. Anyhow it appears to be an exciting framework. If you are happy to experiment, the publications and talks so far have been very promising.

References

[1] Paul-Christian Bürkner. brms: An R Package for Bayesian Multilevel Models Using Stan
[2] B. Carpenter, A. Gelman, et al. STAN: A Probabilistic Programming Language
[3] E. Bingham, J. Chen, et al. Pyro: Deep Universal Probabilistic Programming

Your 10.000 hours in Data Science | A Walk down Data-Lane

The way from data novice to professional.

Clock with reverse numeral, Jewish Townhall Clock, Prague- by Richard Michael (all rights reserved).

There exists the idea that practicing something for over 10000 h (ten-thousand-hours) lets you acquire enough proficiency with the subject. The concept is based on the book Outliers by M. Gladwell. The mentioned 10k hours are how much time you spend practicing or studying a subject until you have a firm grasp and can be called proficient. Though this amount of hours is somewhat arbitrary, we will take a look on how those many hours can be spent to gain proficiency in the field of Data Science.
Imagine this as a learning budget in your Data-apprenticeship journey. If I were to start from scratch, this is how I would spend those 10 thousand hours to become a proficient Data Scientist.

The Core modules

Mathematics and Statistics: Basic (frequentist) statistics, Bayesian statistics, Intro to Data Analysis, some Probability Theory, some basic Analysis and Calculus.
A Data Scientists main job is to provide insights into data from problem domains. To have a firm grasp of the underlying mathematics is essential. You can find a lot of excellent content from available university courses and Coursera courses. This is a good way to get started on your journey and get a rudimentary understanding of how statistics work. Though, fundamental courses are essential, please challenge yourself with the real deal.
Lets give that 2500h (312 days of 8h work).
Analysis/Stats Languages and modules: R, Python (pandas), Julia, SAS.
This is the bread-and-butter for you and can be considered a part of your stats-education. Here it is about learning by doing. Reading a O’Reilly book on R will only get you so far. Load up a curated data-set, some kaggle challenge and work through some problems.
There is a spectrum on which statistical languages lie. So if you’re close to scientific computing you might consider Julia. The other end of the spectrum with just statistical analysis is SAS. Some people argue that R can do both.
1000h (125 days of 8h work)

Multi-purpose programming languages: Python, Go, Java, Bash, Perl, C++,… .
This very much depends on the systems that you are facing on a daily basis. If you just start out, pick a language and stick to it. Once you learn the concept you will pick up different languages easier.
I for myself rely heavily on a combination of Python and Bash for my daily work. Other tasks require a thorough understanding of the good old Java or even Perl to get started.
2000h (250 days of 8h work).

Database Technologies: {T-, My-, U-}SQL, PostgreSQL, MongoDB, … .
Relational or non-relational databases are some of the systems that you will have to work with in a production environment. It is useful to know, how your data is stored, how queries run under the hood, how to reverse a transaction. For later work your data-sources might vary a lot and it is good to have a basic understanding.
750h (94 days of 8h work)

Operating Systems: Windows, Linux, MacOS.
Whatever your work-environment is: master it! Know the ins-and outs. You might need to install some really weird library this one time to solve a specific problem (true story). Knowing where things are and why they are there goes a long way. Know what an SSH connection is and how to run analysis on a different system. Running analysis not on your local machine is going to happen at some point in the future.
500h (62 days of 8h work) .

ETL/ELT Systems: This is a mixture between programming languages, database technologies and operating systems. Frameworks like Spark, Hadoop, Hive offer a advanced means for data-storage and also analysis on cluster computing platforms. Some companies may rely on a different tech-stacks like Microsoft Azure Systems, Google Cloud or AWS solutions.
This goes hand in hand with Database Technologies and might already require a firm understanding of higher level programming languages (like Java or Python). There are also beginner systems, like KNIME to get your feet wet .
400 (50 days of 8h)

Your Problem Cases: This may be Fin-Tech, Bio-Tech, Business Analytics
You should be familiar with the problem domain in which you are working. Is it research and development that you are doing? Are you visualizing business processes and customer behavior? Different field require different insights. Your data has to make sense to you and you should be able to see if a model-output is in a valid range or totally off. Spend a very absolute minimum of 350h in your problem domain. We suggest more. A lot more.
You can decide if you want to be the method jack-of-all-trades or the expert in your field.
> 350h (44 days of 8h)

The attentive reader sees that this only adds up to 7500 hours so far. We have a basis now and you might want to go into a certain direction from here.

Different Data-trades – Your personal direction

Back in the olden days, the dark-unenlightened ages, we had guilds. The field of working with data also has different guilds. Types of tradesmen that solve different problems and there have been different articles on the subject.
Here is how those trades differ in their requirements:

  1. The Data Scientist: You do stats and analysis, you roll-out solutions and deploy platforms that answer questions and provide insights. Neat. Up your statistics game: + 200h. Maybe a bit of ML which we call statistical learning +100h and some type of multi-purpose-language like Python, Perl or Bash for your dayjob.
  2. The Data Engineer: You care about the data. You build the systems that enable the Data Scientist to work effectively and give the analysts their daily bread. Your focus lies on the underlying systems. +500h on Systems, +200h on ETL/ELT systems, +200h on DBs and programming languages +200h. You can even afford to not be so heavily involved in the statistics part (-500h on pure stats and theory).
  3. The Machine Learning Engineer: You implement the models that make the magic happen.
    You need all the statistical insights and also proficiency with all the ML on top of things –
    +300h on ML, + 200h on Stats. You should also be well versed in a high level programming language which makes implementing ML-models possible in the first place.
  4. The Analyst: You take data from the systems, make it pretty and reportable for management. You need to know what matters in your problem domain +200h in problem cases, +100h in DB systems. You need your SQL daily. Though there is a big variety of analyst jobs out there. So inform yourself and continue learning what you see is necessary.
Fig. 1 – Time and experience from a Data Scientist compared to an analyst’s job..
Fig. 2 – Time and experience of an ML-Engineer compared to a Data-Engineer.
Math/Stats = Mathematics and Statistics, includes your work with stat-languages like R.
Systems = Setting up and maintaining systems (e.g. ETL/ELT)
DBs = Database Systems (relational, non-relational)
SWE = Software Engineering (Python, Perl, Java, Scala, Go, JavaScript, …)
Exp. Field = Experience in the problem domain. Hours spent on the topic.

The Figures above illustrate exemplary careers and how much time and effort can be placed in each domain. The figures make suggestions and vary from individual to individual. For example the Data-Engineer in our example could also be called a Software Engineer that handles a lot of data and another engineer would need more experience working with databases compared to programming languages.

As with everything in life the lines are blurred. Sometimes a Data Scientist is also does the work of a ML Engineer and vice versa. Sometimes a Data Scientist also has to do Analytics Dashboards. Stranger things have happened in the past. Keep in mind that the 10k budget is your entry and gives you a direction into the field and not your specialization. So you have all the freedom in the world to eventually do that PhD in Parapsychology that you’ve always wanted.

How much work is ten thousand hours

10k hours is a long time. if you were to work for one year that would be 27h per day and not so feasible for the average mortal.
So better 3 years of work for around 9 h might be better suited.
If you’re studying on the side and look for a change in your job at some point in the future, then you might be set after a couple more years. The important part is to be consistent with it. A couple of quality hours a day over a few years will get you a long way.

Personal Note

My journey started in October 2016 coming from systems administration and starting a new undergraduate program. Now that it is July 2020 it has been around 1360 days of gaining experience in the field. Therefore I am in the field of >7500 h (with over 5.5h every day) and still continue to study and learn a lot every day. My undergraduate program in Cognitive Science helped me get a head-start in the fields of computer science, applied mathematics, Bayesian statistics, DB systems, formal logic, Machine Learning courses and much more. This together with working as a Data Engineer part-time helped me gain a lot of insights really quickly.

The Key element

I don’t care if your do your 10 thousand hours at Uni, in school, at seminars or at home.
Nobody cares. You have to show your work. A degree, if done right, can show this – but it doesn’t have to. It is about gaining experience in the field by solving problems and gaining proficiency with concepts.
Solve a problem and put it in a portfolio, your resume or simply your github-repository.
Whatever works best for you. People hire for your problem solving skills. Not for the A that you got on that one project 2 years ago. You might say that it was a really cool project altogether.
That’s great! if you made a dent in the universe you should tell people about it anyways.

The journey is a long one and if you enjoy what you do it is worthwhile. You will learn a lot along the way.

A short Disclaimer: Always keep in mind that there are different fields of Data Science related work out in the wild.
Every job has its specific purpose and requires a different tool-sets.
Every employer might look for different sets of skills.