Millennial Suicides – an Analysis about Change

A simple, yet meaningful Pyro model to illustrate change-points over time.

Walking alone. A photo by Mitchel Lensink on Unsplash.

One profound claim and observations by the media is, that the rate of suicides for younger people in the UK have risen from the 1980s to the 2000s. You might find it generally on the news , in publications or it is just an accepted truth by the population. But how can you make this measurable?
In order to make this claim testable we look for data and find an overview of the suicide rates, specifically England and Wales, at the Office for National Statistics (UK) together with an overall visualization.

Making an assumption tangible

Generally, one type of essential questions to ask – in everyday life, business or academia – is when changes have occurred. You are under the assumption that something has fundamentally changed over time. In order to prove that you have to quantify it. So you get some data on the subject-matter and build a model that shows you when changes have occurred as well as their magnitude. We are going to look at exactly how to do that.

Millennial Suicides Specifically

Now that we have our data, we take the average for the age-group of millennials. That entails ages in the range of 25 to 35, for the sake of simplicity. Given our dataset, we can visualize the average suicide count for millennials since the 1980s like this:

suicide_millenials_df = suicide_data[suicide_data.Age.isin(range(23,35))].groupby(["Year"]).mean().reset_index()
# round values to closest integer
suicide_millenials_df["Suicides"] = suicide_millenials_df.Suicides.round()
suicide_millenials_df.head()
Yearavrg. Suicide per 100.000
198185.0
198283.0
1983
What the head of your dataframe should look like.
Fig. 1 – The average suicides (per 100.000 people) in the age group 25 to 35 for England and Wales from 1981 to 2017.

From this data we can clearly see a steady increase over time until the very early 2000s followed by a decline until the 2010s. The inquisitive side of us wants to know when exactly that happened so that we can go out and look for the root causes and the underlying issues associated with those trends. Lets build a model to find out exactly in which years to look for that substantial change. To look for a specific change-point we will use the Pyro PPL as a probabilistic framework.

The How

We will use the mathematical underpinnings that are described in [2]. This entails:
a uniform prior distribution over the years T , the rate of change \mu from a half-normal distribution (the positive side of the Gaussian) in inversion-bracket notation. Looking at the data we set a large scale for our rate as we want to capture the change from 75 to 130 – something around 50. Lastly we find the observed rate of suicides through a Poisson regression.
Note: The Poisson regression entails, that we will sample from a normal distribution.

T \sim \mathcal{U}(1981,2017), \mu_0\sim\mathcal{N}^+(0, 50), \mu_1\sim\mathcal{N}^+(0, 50), n_t\sim\text{Poisson}(\mu_{[t>T]})

These parameters can be adjusted – maybe you look for different scales of changes, either across time or effect-size. One might consider a half-Cauchy distribution for more extreme values, maybe an increase or decrease on the scale.

The Pyro Model

def model(years, suicides):
    σ = pyro.param('σ', torch.ones(data_len))
    T = pyro.sample('change', dist.Uniform(1981, 2017))
    grp = (years > T) * 1
    # Independent events 
    with pyro.plate('rate', 2):
        μ = pyro.sample('μ', dist.HalfNormal(scale=50))
    y_obs = pyro.sample('obs', dist.Normal(μ[grp], σ), obs=suicides)

The grp gives us the index for our bracketed \mu .
The model observations are then sampled from the Normal distribution taking into account \mu and a scale=\sigma=1 (last line in the model).

Before we can run this model we convert our dataframe to Pyro readable tensors using PyTorch, like so:

years = torch.from_numpy(suicide_millenials_df.Year.values)
suicides = torch.from_numpy(suicide_millenials_df.Suicides.values)
data_len = len(suicide_millenials_df)

For the fitting procedure we perform Bayesian Inference utilizing MCMC methods. We sample for 1000 iterations and let the sampler burn-in up for 300 iterations.

SAMPLES = 1000
WARMUP = 300
nuts_kernel = NUTS(model)
mcmc = MCMC(nuts_kernel, num_samples=SAMPLES, warmup_steps=WARMUP)
mcmc.run(years, suicides)

We specifically use the hands-off NUTS sampler to perform inference and find values for our Pyro parameters and recommend that you do the same.
Beware that in contrast to Variational inference approaches the MCMC takes its time. After training we find a relatively high acceptance probability of around 91.2% and a stepsize \epsilon of around 0.00198.
Checking the model summary we see that everything is in order- no divergences – maybe our R-hat values are even too high.

            mean       std    median      5.0%     95.0%     n_eff     r_hat
    change   1995.40      2.10   1995.40   1991.72   1998.24      3.04      1.94
      μ[0]     96.21     10.41     97.56     93.81    101.23     20.16      1.08
      μ[1]     87.62      2.59     87.90     83.01     90.94      3.03      1.95

Now with the samples at hand, we can display when a change-point has occurred most frequently over time:

Fig. 2 – Histogram of the occurrence of the change-point across years.

We go ahead and plot our different rates over the given time

def pl(pt):
    grp = (years > pt['change']) * 1
    line = plt.plot(years, pt['rate0'] * (1 - grp) + pt['rate1'] * grp, 
      color='red', alpha=0.005)    

df.apply(pl, axis=1)
Fig. 3 – Change-rate over time (red-line) for 1000 samples. There is a change around 1995 to 1996 to mark the downwards trend.
Fig. 4 – Another rate of change over time as the red line. We see over three events between 1986 to 1989 the change has manifested itself. This run took 2000 samples.

First we ran the previously described model with a 1000 samples. This uncovered the change-point around the year 1995 to indicate the upcoming downwards trend. The output can be seen in Fig.3 where the rate is indicated as a red-line. Here model has quantified the change-point for the better development.
We took another shot at running the model for which we took 2000 samples and the model converged on a time-range in the years of ’86 to ’89 for yet another change-point. This marks the potential turning-point for an upwards trend. If we want to uncover the cause behind an increase in suicides among Millennials we have to look for reasons in that time-range or earlier changes, of which the effects have manifested themselves into this time.

Open Questions

We have shown that there is not just one change happening over time. There is a range of points for an increase in the rate and a range of points for a decreasing rate. One might extend the model to pick up more than one change-point, making the \mu three dimensional or fitting an entirely different model.

Concluding the Model

We have shown how to get from data, to a concrete formulation of a change in time and their underlying probabilities. This was directly translated in Pyro and through sampling we could infer concrete change-points. Probabilistic Programming is a powerful engine.

The complete notebook can be found here: https://github.com/RMichae1/PyroStudies/blob/master/changepoint_analysis.ipynb

The Gravity on the Subject

Though this was intended as an introductory piece on the topic of Data Science, the topic of self-harm and death is a grave one. In case you or a loved one are affected there is help and resources out there. So I will leave you with this message here:

References

  1. Office for National Statistics, UK. Middle-aged generation most likely to die by suicide and drug poisoning. Dataset on suicide-rates. 2019.
  2. C. Scherrer. Bayesian Changepoint Detection with PyMC3. 2018.
  3. M. Hoffman, A. Gelman. The No-U-Turn Sampler. JMLR v15. 2014.

Improve Your Model with Missing Data | Imputation with NumPyro

Make the best of missing data the Bayesian way. Improve model performance and make comparative benchmarks using Monte Carlo methods.

A Missing frame ready to throw you off your model. Photo by Vilmos Heim on Unsplash.

The Ugly Data

How do you handle missing data, gaps in your data-frames or noisy parameters?
You have spent hours at work, in the lab or in the wild to generate or curate a dataset given an interesting research question or hypothesis. Terribly enough, you find that some of the measurements for a parameter are missing!
Another case that might throw you off is unexpected noise that was introduced at some point in the experiment and has doomed some of your measurements to be extreme outliers.
Sometimes you cannot just exile a whole parameter from your analysis, simply because it has 25% missing entries or maybe entails a “does not apply” field in your questionnaire.

Fear not, there is a way you can make the most of your dataset anyways.
In this post we will use Probabilisitc Programming [5] to fill the gaps that your model might face.

Simulate Missing Data

Lets build a problem from scratch. The studious reader is most likely familiar with the Iris flower data-set. It is somewhat part of the “hello-world” in the ML community; next to the well-known MNIST that is.
We use sklearn.datasets to get the data and put it into a pandas dataframe like so.:

iris_data = datasets.load_iris() 
iris_df = pd.DataFrame(data=np.c_[iris_data['data'], iris_data['target']],
                      columns=iris_data['feature_names'] + ['target'])

To make this example more explicit lets select for only 2 out of the given 3 species.:

iris_df["target"] = iris_df.target.astype("int")
iris_df = iris_df[iris_df.target.isin([0, 1])]

Now you should see 100 measurements (rows that are our n) of the Iris flower data.
We now take out 50% of the petal length entries by chance and assign NaN values with a bit of numpy and pandas magic.

random_vec = np_random(iris_df["petal length (cm)"].shape)<0.5 # 50% marked by numpy random vector on the column
iris_df["petal length (cm)"] = iris_df["petal length (cm)"].where(random_vec, other=np.nan)

Why do we care about all of our measurements?

When we plot our data, we can clearly see that the petal length and petal width can help us identify which species (target-column) we have (see Fig. 1).

Fig. 1: The relation of petal length and width toward the flower species aka. target-column. We can clearly identify two clusters.

We can make an even clearer distinction by regressing on the groups using just the vector which we have decimated earlier.

Fig. 2: The regression of the petal-length parameter on the identified species (target)

Too bad about half of the data is missing. So what do we do now?

Different Shades of Missing

First off, there are different degrees of missing data:

Missing Completely at Random (MCAR) and Missing at Random (MAR).

For a formal introduction to this see Bayesian Data Analysis [1] Ch. 18

One important distinction is that MCAR treats your missing parameters as independent of your observed measurements (lets call them y).
In our example we rely on our parameter being missing at random or MAR. That means we don’t think that the chance of a value gone missing is dependent on the parameter property BUT we think that there is potentially some relation between our length measure and the target species.

A Regression Model to Rule them all

First lets extract some basic info from the data of what we have so far:

>>> print("Petal Length μ = {}".format(iris_df["petal length (cm)"].mean()))
>>> Petal Length μ = 2.8072727272727267
>>> print("Petal Length σ^2 = {}".format(iris_df["petal length (cm)"].std()))
>>> Petal Length σ^2 = 1.4923079993698234

Now we build the model to do the imputation. For this we use the tool NumPyro, which is light-weight probabilistic numpy backend to Pyro – a heavy duty probabilistic language.
This comes in handy after formulating the model. We will use MCMC methods to fit the model and simulate values for missing entries.
To make this more applicable to a use case we combine this with a Bayesian regression-model. That means we want to model the distinction in species given all parameters that we know. Our parameters therefore are:

sepal length := s_len, sepal width := s_width, petal width := p_width and, lastly petal length := p_len, which includes our missing data. 

For each of those parameters we introduce a learnable bias which comes from a regular Normal distribution aka Gaussian for the sake of simplicity. So lets get started.:

def model(s_len, s_width, p_len, p_width, target=None):
    b_s_len = numpyro.sample("b_s_len", dist.Normal(0,1))
    b_s_width = numpyro.sample("b_s_width", dist.Normal(0,1))
    b_p_width = numpyro.sample("b_p_width", dist.Normal(0,1))
    ...

One can do a bit of tweaking for the bias parameters above, but for our initial runs the normal Gaussian worked reasonably well.
Now the delicate part. We introduce the distribution which we assume lies under our half-destroyed parameter. As with a lot of things we assume this to be Gaussian, which in our case is a reasonable assumption. Therefore we take the get loc=\mu (2.8 +/- 0.2) and scale=\sigma^2 (1.5+/-0.2) which we had computed earlier:

...    
    len_mu = numpyro.sample("p_length_mu", dist.Normal(2.8, 0.2))
    len_sigma = numpyro.sample("p_length_sigma", dist.Normal(1.5, 0.2))
...

The informed reader clearly sees how the \mu and \sigma themselves are probabilistic parameters for themselves which we model in the process. For both we assume Gaussians but more rigorous distributions could be chosen.
Now we have to find our targets which we model. Hence we localize our missing positions in the vector with numpy.

...    
    len_is_nan = np.isnan(p_len)
    len_nan_idx = np.array(np.isnan(p_len).astype(int)).nonzero()[0]
    
    if target is not None:
        len_impute = numpyro.param("len_impute", np.zeros(len_is_nan.sum()))
    else:
        len_impute = numpyro.sample("len_impute", dist.Normal(len_mu[len_nan_idx], 
                                                              len_sigma[len_nan_idx]))
...

Additionally, we tell NumPyro to make the imputation-value (len_impute) a learnable parameter and fill with as many zeroes as values we are missing.
On a short sidenote: our model has a predictive mode, that occurs when no target (aka species) is provided. When we invoke the predictive capabilities, we sample from the introduced distribution with the sampled \mu and sampled \sigma instead. Make sure to only apply this on the NaN positions in your vector! That is the meaning of the else clause:

numpyro.sample("len_impute", dist.Normal(len_mu[len_nan_idx], len_sigma[len_nan_idx])

We then put the imputed values into the right places and give it the inference sampling, with the observed length values to do its optimization. For this we use the JAX library with some efficient index updating – linear-algebra style.:

...    
   p_len = ops.index_update(p_len, len_nan_idx, len_impute)
  
   numpyro.sample("p_length", dist.Normal(len_mu, len_sigma), obs=p_len)
...

In our length-sampling we include the actual length measures as the observations in the process .

Putting Things Together | Imputed Values into the Linear Model

For this model to be a proper linear model, we have to fit our target y (the species).
Since we have two classes, we do this by invoking a Bernoulli distribution.
We therefore ask in a Bernoulli process whether we are looking at y=0 or y=1 – two classes, two potential outcomes.

The Bayesian Regression – Code

We introduce two parameters, one of which is a bias term for the length parameter. We compose the

...    
    l = numpyro.sample("l", dist.Normal(0, 1))
    b_len = numpyro.sample("b_length", dist.Normal(0, 1))
    logits = l + b_len * p_len
    
    logits = logits + b_s_len*s_len + b_s_width*s_width + b_p_width*p_width
...

Now to the magical part. We use the logit for the Bernoulli, which composes of all the parameters that we put into the model in interaction with their biases. The observed value in this process is now the target-parameter, our y.

   numpyro.sample("target", dist.Bernoulli(logits=logits), obs=target)

The complete NumPyro imputation model

def model(s_len, s_width, p_len, p_width, target=None):
    b_s_len = numpyro.sample("b_s_len", dist.Normal(0,1))
    b_s_width = numpyro.sample("b_s_width", dist.Normal(0,1))
    b_p_width = numpyro.sample("b_p_width", dist.Normal(0,1))
    
    # impute length
    len_mu = numpyro.sample("p_length_mu", dist.Normal(2.8, 0.2))
    len_sigma = numpyro.sample("p_length_sigma", dist.Normal(1.5, 0.2))
    len_is_nan = np.isnan(p_len)
    len_nan_idx = np.array(np.isnan(p_len).astype(int)).nonzero()[0]
    
    if target is not None:
        len_impute = numpyro.param("len_impute", np.zeros(len_is_nan.sum()))
    else:
        len_impute = numpyro.sample("len_impute", dist.Normal(len_mu[len_nan_idx], 
                                                              len_sigma[len_nan_idx]))
    p_len = ops.index_update(p_len, len_nan_idx, len_impute)
    
    numpyro.sample("p_length", dist.Normal(len_mu, len_sigma), obs=p_len)
    
    l = numpyro.sample("l", dist.Normal(0, 1))
    b_len = numpyro.sample("b_length", dist.Normal(0, 1))
    logits = l + b_len * p_len
    
    logits = logits + b_s_len*s_len + b_s_width*s_width + b_p_width*p_width
    if target is None:
        # prediction case
        probs = expit(logits)
        numpyro.sample("probs", dist.Delta(probs))

    numpyro.sample("target", dist.Bernoulli(logits=logits), obs=target)

Run It

To fit the model we use Bayesian Inference utilizing a MCMC method; the NUTS sampler to be specific. The beauty of the NUTS sampler is that it is a hands-off approach, as it adjusts step sizes by itself, given its stepping through the problem-space. We let it have 5000 warm-up or burn-in runs, which are discarded in the process and let it run for 10000 iterations, just to be sure. Our inference finishes with an acceptance probability of 0.96 or 96% – neat!. We then have a look at the summary what has happened:

mcmc = MCMC(NUTS(model=model), 5000, 10000, num_chains=1)
mcmc.run(random.PRNGKey(42), **iris_data)

mcmc.print_summary()
Fig. 3: Model Summary after MCMC indicates convergence and solid R-hat values as metric.

The output clearly shows convergence of the parameters which we sampled in our inference. Specifically the R-hat values are all >= 1.0 .

We then pull the samples and visualize the posterior – we rely on JAX again for the formatting magic:

samples = mcmc.get_samples()
posterior_mu = jnp.expand_dims(samples['l'], -1) + \
               jnp.expand_dims(samples['b_length'], -1)*iris_df["petal length (cm)"].values + \
               jnp.expand_dims(samples['b_s_len'], -1)*iris_df["sepal length (cm)"].values + \
               jnp.expand_dims(samples['b_s_width'], -1)*iris_df["sepal width (cm)"].values + \
               jnp.expand_dims(samples['b_p_width'], -1)*iris_df["petal width (cm)"].values
Fig. 4: Final Bayesian regression over MCMC samples with 90% CI.

There we have it. A clear distinction between our two classes. To be fair there is quite a bit of jitter going on above and below the bounds, but the regression-line could not be nicer.

We have shown that we can fill in the blanks in our model through MCMC-sampling.

For the code-savvy curious reader the final code can be found on my github and in Colab.

What if there are more than two classes, What about Categorical Data?

Great questions, which can be answered with NumPyro’s toolkit as well:

  1. Instead of having a Bernoulli trial in the model you can also use a Multinomial for your Bayesian inference.
  2. We can also use categorical data for model fitting. That looks quite different. Instead of using the one-dimensional array we rely on as many dimensions as there are categories.

A good example to fit with categorical data instead of continuous one is in the excellent kaggle notebook on Age Imputation given other parameters for the Titanic dataset:

https://www.kaggle.com/fehiepsi/bayesian-imputation-for-age

The Dos and Don’ts

Please stick to what you have!

Do not invent data. I cannot stress that enough. It is a very fine line that we are walking here.
If your experiments have generated insufficient data or you had to throw some data out, then imputing values, might not be the best way to go. Even when your data might become presentable that way. When talking to a client or an academic journal you don’t get points for: we simulated parameter X and Y to 85.3% – no matter how nice the model looks.
Please do declare what methods, distributions and, assumptions you used to make the models and imputations for maximum transparency.
One recommendation is that you can always take out the data, fit a model without missing measurements and compare it to a model with imputed values in the end. That can make for a compelling story.

Conclusion

We have walked through data imputation, from the data-frame with holes in it to the final Bayesian inference and fitted posterior distribution.
We have shown how your Bayesian regression can benefit from an imputed vector and what the output looks like including confidence intervals.
Probabilistic Programming can be a powerful tool in your daily work.

Best of luck with your models and analysis!

References

[1] A. Gelman, J.B. Carlin, et. al., Bayesian Data Analysis . Third Edition.
[2] R. McElreath. Statistical Rethinking. 2016. CRC Press.
[3] Uber Technologies. Regression using NumPyro. 2019. NumPyro Documentations.
[4] M. Betancourt. A Conceptual Introduction to Hamiltonian Monte Carlo. 2020 on ArXiv.
[5] CS 4110 – Programming Languages and Logics. Probabilistic Programming. 2016. Cornell University.

How your model is optimized | Know your Optimization

The answer to: “Why is my model running forever?” or the classic: “I think it might have converged?”

The driver behind a lot of models that the average Data Scientist or ML-engineer uses daily, relies on numerical optimization methods. Studying the optimization and performance on different functions helps to gain a better understanding of how the process works.
The challenge we face on a daily basis is that someone gives us a model of how they think the world or their problem works. Now, you as a Data Scientist have to find the optimal solution to the problem. For example you look at an energy-function and want to find the absolute, global minimum for your tool to work or your protein to be stable. Maybe you have modeled user-data and you want to find the ideal customer given all your input-features — hopefully continuous ones.
Here is a comprehensive summary of how an optimizer works under the hood:

Benchmark Functions – the curved, the bent and the asymmetric

Before we meet our optimizers we have to get acquainted with the functions that we intend to optimize. My three personal favorites are:

  1. Mr. Nice-Guy is the ellipsoid function. It is overall nicely behaved, convex and has a global minimum.
    it is defined as f_1(x)=\sum_{i=1}^d\alpha\frac{i-1}{d-1}(x_i)^2
  2. The Rosenbrock or also banana-function. It is not as nice as our f_1 – through its valley of-death it can be a challenge to optimization algorithms. It is defined as f_2(x)=(1-x_1)^2+100*(x_2-x_1^2)^2.
  3. The twisted-one is our attractive-sector function. It is the most unusual of the selection as it is non-convex and asymmetric around the minimum. It is defined as f_3(x)=\sum_{i=1}^d h(x_i)^2+100(-x_i)^2 where h(x)=\frac{\log(1+\exp(qx))}{q}.

This is comprehensible selection of functions and there exists a multitude of available benchmark-functions to run your optimizer across. From the depths of the Styblinski-Tang function to the beautiful peeks of Rastrigin. Some of which are ready made and implemented like in [1].

As you might have noticed the functions work on R^n and we provide an input vector \textbf{x}\in R^n. We will only be looking at two dimensions R^2 and three dimensions R^3 to make the display easier.

Now that we have set-up a playground we can just load up our tool of choice like scipy with its optimization library. Our goal is to capture how the optimizer finds the minimum on the benchmark function given different starting points. We just pick and choose any optimizers like simple gradient descent “CG” or the more intricate BFGS. It goes like this:

from scipy.optimize import minimize

# import benchmark function with its derivatives
from scipy.optimize import rosen
from scipy.optimize import rosen_der
from scipy.optimize import rosen_hess

import numpy as np

def banana(x, y):
""" 
a nice 2D way to look at Rosenbrock 
"""
    return (1 - x)**2 + (100 * (y - (x**2))**2)

# we select some interesting optimizers:
OPTIMIZERS = ['CG', 'BFGS', 'Newton-CG', 'dogleg']
FUNCTIONS = {'rosen': [rosen, rosen_der, rosen_hess, banana]
# feel free to add or implement ellipsoid or other benchmark functions here with their derivatives
}

start_values = np.random.uniform(low=x_min, high=x_max, size=(10, 2))

start_iter = 1
max_iter = 50
step = 1
x_min, x_max = -5, 5
y_min, y_max = -5, 5

def optimize_funct(fname, funct, x0, derivative, hess, optimizer, iteration):
"""
Just a wrapper around the scipy minimize function
"""
  fitted_min =  minimize(funct, x0=x0, method=optimizer, 
                         jac=derivative, hess=hess,
                         options={'maxiter':iteration})
  return fitted_min

# run actual optimization
fitted_optimizers = {}
for opt in OPTIMIZERS:
  fitted_optimizers[opt] = {}
  for name, functs in FUNCTIONS.items():
    fitted_optimizers[opt][name] = []
    f = functs[0]
    fd = functs[1]
    fdd = functs[2]
    f_2d = functs[3]
    for vals in start_values:
      computed_values = []
      x, y = vals
      z = f_2d(x,y)
      # include start values before optimization
      computed_values.append(np.array([x,y,z]))
      for i in range(start_iter, max_iter, step):
        out = optimize_funct(fname=name, funct=f, x0=vals, derivative=fd, hess=fdd,
                            optimizer=opt, iteration=i)
        # only save the output values (stored under x)
        x, y = out.x
        z = f_2d(x, y)
        computed_values.append(np.array([x, y, z]))
      fitted_optimizers[opt][name].append(np.array(computed_values))

Not all optimizers are created equal. Some require the first and some the second derivative for our optimization functions aka the Jakobian and the Hessian . For an extra brief recap of what those are and how they work you can look here.
One thing to keep in mind is, that we start at different starting positions and let the optimizer roam freely – we are not doing constraint optimization just yet.

Fig. 2 – Contour plot of the Ellipsoid function from different starting points using standard gradient descend. Each step is a “+”. The optimizer finds the global minimum after a few steps.
Fig. 3 – Contour plot of the Rosenbrock- (Banana-) function. Gradient descent first walks into the right direction (see black “+”), but it looks for the global minimum to no avail. The valley it traverses is already low in function-value and information on where to look next is not as informative.
Fig 4. Attr. Sector contour plot of the function. BFGS optimization on the function space. Given three starting points (three colors) each step is depicted by a black cross. The minimum is quickly discovered and we see how a linear method (CG) behaves differently compared to BFGS which is from the family of quasi-newton methods.

Personally it helped me a lot to visualize how the optimization procedures move through the different function-spaces. Here you can see how CG and BFGS perform. Now CG is scipy’s implementation of gradient descent . It is the plain and simple optimization procedure and a classical line-search method. BFGS on the other hand uses approximated higher order information. BFGS therefore classifies as a quasi-Newton method.

A quasi-Newton method approximates higher order information like the Hessian Matrix (second order derivatives) and makes improvements per each step taken. [2]

Line-Searching | Backtrack Everything.

One method of optimization is simply following a line given the underlying function (see Fig. 2 and Fig. 3). We keep some information through backtracking, but that is about it. In order to optimize we may utilize first derivative information of the function. An intuitive formulation of line search optimization with backtracking is:

  1. Compute gradient at your point
  2. Compute the step based on your gradient and step-size
  3. Take a step in the optimizing direction
  4. Adjust the stepsize by a previously defined factor e.g. \alpha
  5. repeat until minimum is found or the difference in value between your steps is very (very) small – e.g. 10^{-8}
Gradients give us information on how the slope of a function behaves. We take the gradient as the first-order-derivative or partial derivative of a function. It is depicted as a vector and defined as \nabla f(x)=[\frac{\delta f}{\delta x_1}\ldots\frac{\delta f}{\delta x_n}]^T. 

Trust-Region Optimizers | Thinking in Circles

Instead of just looking straight ahead and straight back we can also look around us in a 360 degree fashion. We look at our function-information like the Jacobian and Hessian and make an optimal step in the right direction. In order for this to work we consider the quadrativ model function m_k(p)=f_k+g_k^Tp+\frac{1}{2}(p^TBp) where g is the gradient of f and B the Hessian.
When we do the optimization we optimize for p and adjust the radius that is around us appropriately. This looks like:

Fig. 5 Trust-Region Optimization on the contour of the attr.-sector function. Each step is depicted as “+” and around each step is the region from which the next step is chosen as a circle.

An intuitive formulation of the procedure can look like this
(see the Appendix for a formal representation):

  1. Compute Hessian and Gradient of your function
  2. Choose a point in a radius around your position that is optimal
  3. Take a step given the computed information
  4. Adjust the size of the radius given on how much or how little you have improved
  5. Repeat until convergence or tolerance is reached

Of course these short steps don’t quite capture the complexity, you need more parameters and a bit more math to find the right point in your radius, adjust the step and the radius if that is interesting to you see Alg. 1 and 2 in the Appendix.

We can observe an algorithm in practice and look at how the radius changes over time:

Fig. 6 – magnitude changes of function values (y axis) over steps taken (x axis). The radius \Delta changes with the improvement of the function-values.

We see that if the function values improve over time then our radius gets smaller (see Fig. 5 and 6) whereas if the values are bad initially we increase our radius. See the first steps of the optimization on the attr. sector function in Fig. 6.

Conclusion | The Rabbit Hole

We have seen how two families of optimizers work: the straight-lined linear optimizer and the trust-region optimizer. This is only the entrance of the rabbit hole. With each function and problem we look at we can discover something new. We can analyze how our Jacobian or Hessian matrix have to behave to make our optimizer do its work. We have to look at the shape and properties of our functions.
Obviously there exists a vast number of available proofs and theorems that are the mathematical backbone of what we have just looked at.
If you want to dive deep and get your hands on the raw mathematics I recommend going through the book Numerical Optimization (2nd ed.) by J. Nocedal and S. J. Wright [3].


Acknowledgements

Much of the work that I have shown here I had done in attending the Numerical Optimization course at KU – DIKU. Without the guidance of the course-officials this would not have been possible.

References

[1] MathWorks Test functions for global optimization algorithms
[2] C. Geiger and C. Kanzow . Quasi-Newton-Verfahren Springer 1999
[3] J. Nocedal and S. J. Wright Numerical Optimization (2nd ed.) Springer 2006

Appendix

Trust Region Algorithm

Algorithm for the region of trust optimization in Fig. 5
Alg. to solve for p – solving for lambda is another subproblem.