Optimizing Hyperparameters the right Way

Efficiently exploring the parameter-search through Bayesian Optimization with skopt in Python. TL;DR: my hyperparameters are always better than yours.

Explore vast canyons of the problem space efficiently – Photo by Fineas Anton on Unsplash

In this post we will build a machine learning pipeline using multiple optimizers and use the power of Bayesian Optimization to arrive at the most optimal configuration for all our parameters. All we need is the sklearn Pipeline and Skopt.
You can use your favorite ML models, as long as they have a sklearn wrapper (looking at you XGBoost or NGBoost).

About Hyperparameters

The critical point for finding the best models that can solve a problem are not just the models. What we need is to find the optimal parameters to make your model work optimally, given the dataset. This is called finding or searching hyperparameters.

For example, we would like to implement a Random Forest in practice and its documentation states:

class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, ...

Then every of those parameters can be explored. This can include all possible number of estimators ( n_estimators ) in the forest from 1 to 10.000, if you would like to split using {“gini”, “entropy”}, or the maximum depth of your trees and many, many more options. Each of those parameters can influence your model‘s performance and worst of all most of the time you do not know the right configuration when you’re starting out with a new problem-set.

The Jack-Hammer aka Grid-Search

The brute-force way to find the optimal configuration is to perform a grid-search for example using sklearn’s GridSearchCV. This means that you try out all possible combinations of parameters on your model.
On the bright side, you might find the desired values. The problem is that the runtime is horrible and over all grid-searching does not scale well. For every new parameter you want to try out, you will test all of the other previously specified parameters also.
After all, this is very uninformative and we have the intuition that some picks of parameters are more informative than others – right?
We don’t need to test for all parameters – especially not those of which we know that they are far off.
One step in the right direction is randomized search like RandomizedSearchCV, where we pick the parameters randomly while moving in the right direction.

Better Bayesian Search

Our tool of choice is BayesSearchCV. This approach uses stepwise Bayesian Optimization to explore the most promising hyperparameters in the problem-space.
Very briefly, Bayesian Optimization finds the minimum to an objective function in large problem-spaces and is very applicable to continuous values. To do this it uses Gaussian Process regression on the objective function under the hood. A thorough mathematical introduction can be found in [2].
The Bayesian Optimization approach gives the benefit that we can give a much larger range of possible values, since over time we automatically explore the most promising regions and discard the not so promising ones.
Plain grid-search would need ages to stupidly explore all possible values.
Since we move much more effectively, we can allow for a much larger playing field.
Let’s look at an example.

The Problemset

Today we use the diabetes dataset from sklearn for ease of use.
This saves us the trouble of loading and cleaning our data and the features are already neatly encoded.

Fig. 1 – Overview over the dataset.

We have a set of encoded columns from age, sex, bmi, blood-pressure and serum values, numerically encoded. Our target value is a measure of the disease progression.
We can see some interactions on the s-columns, indicating correlation for the blood-values.

Fig. 2 – Excerpt from a pairplot examining parameter interactions over all data.

To build our pipeline we first split our dataset into training and testing respectively in a 80:20 split.

X_train, X_test, y_train, y_test = train_test_split(diabetes_df.drop(columns="target"), diabetes_df.target, test_size=0.2, random_state=21)

Build a Pipeline

We need three elements to build a pipeline: (1) the models to be optimized, (2) the sklearn Pipeline object, and (3) the skopt optimization procedure.

First, we choose two boosting models: AdaBoost and GradientBoosted regressors and for each we define a search space over crucial hyperparameters. Any other regressor from the depth of the sklearn library would do, but boosting might win you the next hackathon (…its 2015 isn’t it?)
The search-space is a dictionary with the key-value pair := { ‘model__parameter‘ : skopt.space.Object}. For each parameter we set a space from the skopt library in the range that we want. Categorical values are also included by passing them as a list of strings (see Categorical below):

ada_search = {
    'model': [AdaBoostRegressor()],
    'model__learning_rate': Real(0.005, 0.9, prior="log-uniform"),
    'model__n_estimators': Integer(1, 1000),
    'model__loss': Categorical(['linear', 'square', 'exponential'])
}

gb_search = {
    'model': [GradientBoostingRegressor()],
    'model__learning_rate': Real(0.005, 0.9, prior="log-uniform"),
    'model__n_estimators': Integer(1, 1000),
    'model__loss': Categorical(['ls', 'lad', 'quantile'])
}

Second, we select over which regression model to pick through another model, this is our pipeline element, where both optimizers (adaboost and gradientboost) come together for selection:

pipe = Pipeline([
                 ('model', GradientBoostingRegressor())
])

Third, we optimize over our searchspace. For this we invoke the BayesSearchCV. We also specify how the optimizer should call our search-space. In our case that is 100 invokations. Then we fit the pipeline with a simple skopt .fit() command:

opt = BayesSearchCV(
    pipe,
    [(ada_search, 100), (gb_search, 100)],
    cv=5
)

opt.fit(X_train, y_train)

After the fitting is done, we can then ask for the optimal found parameters. This includes using the score function on the unseen test-data.

Fig. 3 – Evaluating the fitted pipeline

We can see that the validation and test-score. The model with the best results is the AdaBoost model with a linear loss and 259 estimators at a learnin-rate of 0.064. Neat.

From the output we can see that the optimizer uses a GP for optimization under the hood

Fig. 4 – The insights of the optimizer results with a peak into the model.

For illustrative purposes we can also call GP minimization specifically – we’ll use the AdaBoost regressor for that – only focusing on a numerical hyperparameter space.

ada = AdaBoostRegressor(random_state=21)

# numerical space
space  = [Real(0.005, 0.9, "log-uniform", name='learning_rate'),
          Integer(1, 1000, name="n_estimators")]

The major change here is that we have to define an objective function over which we have to optimize. In order to see how the model performs over the parameter-space we use the named arg decorator.

@use_named_args(space)
def objective(**params):
    ada.set_params(**params)
    return -np.mean(cross_val_score(ada, X_train, y_train, cv=5, n_jobs=-1,
                                    scoring="neg_mean_absolute_error"))

Therefore our optimization is the negative mean score that we get from the cross-validated fit while using the negative mean absolute error . Other scoring functions might suit you better.
We then use GP minimzation to fit the most optimal parameters for our regressor.

gp_minimize(objective, space, n_calls=100, random_state=21)

Visualize the problem space – post-optimization

Using GP optimization directly allows us to plot convergence over the minimization process.

Fig. 5 – Convergence of GP minimization while finding the optimal hyperparameters of the AdaBoost regressor with respect to the target column in the dataset.

We can see that the min in the function value has already converged after around 40 iterations .
The last excellent feature is visualizing the explored problem space. Since we used only numerical input at this point, we can then evaluate the problem space by plotting it using skopt, like so:

Fig. 6 – Visualizing the invoked learning_rate and n_estimator values over the problem space as histograms. The minimum is marked by the red star in the scatterplot.

From the figure above we can see that a lot of exploration was done on the lower end of our learning-rate spectrum, and the peaks of the tried and tested estimator number was 200 and over 800.
The scatterplot paints a picture of how difficult the problem space is that we are trying to optimize – with the minimum marked with a red star. Though we have reached convergence not a lot of values are evident in the region around the minimum and increasing the number of evaluations can be considered.

Now that we have the best model given our parameter space we can implement that model accordingly and go into its specific analysis.
As you have seen skopt is a library that offers a lot in the realm of optimizations and I encourage you to use them in your daily ML engineering.

Happy Exploring!

The complete code can be found in a Notebook here.

References

  1. Marc ClaesenBart De Moor. Hyperparameter Search in Machine Learning. 2015 on arXiv.
  2. Peter I. Frazier. A Tutorial on Bayesian Optimization. 2018 on arXiv.
  3. scikit-optimize contributors (BSD License). Scikit-learn hyperparameter search wrapper.

How your model is optimized | Know your Optimization

The answer to: “Why is my model running forever?” or the classic: “I think it might have converged?”

The driver behind a lot of models that the average Data Scientist or ML-engineer uses daily, relies on numerical optimization methods. Studying the optimization and performance on different functions helps to gain a better understanding of how the process works.
The challenge we face on a daily basis is that someone gives us a model of how they think the world or their problem works. Now, you as a Data Scientist have to find the optimal solution to the problem. For example you look at an energy-function and want to find the absolute, global minimum for your tool to work or your protein to be stable. Maybe you have modeled user-data and you want to find the ideal customer given all your input-features — hopefully continuous ones.
Here is a comprehensive summary of how an optimizer works under the hood:

Benchmark Functions – the curved, the bent and the asymmetric

Before we meet our optimizers we have to get acquainted with the functions that we intend to optimize. My three personal favorites are:

  1. Mr. Nice-Guy is the ellipsoid function. It is overall nicely behaved, convex and has a global minimum.
    it is defined as f_1(x)=\sum_{i=1}^d\alpha\frac{i-1}{d-1}(x_i)^2
  2. The Rosenbrock or also banana-function. It is not as nice as our f_1 – through its valley of-death it can be a challenge to optimization algorithms. It is defined as f_2(x)=(1-x_1)^2+100*(x_2-x_1^2)^2.
  3. The twisted-one is our attractive-sector function. It is the most unusual of the selection as it is non-convex and asymmetric around the minimum. It is defined as f_3(x)=\sum_{i=1}^d h(x_i)^2+100(-x_i)^2 where h(x)=\frac{\log(1+\exp(qx))}{q}.

This is comprehensible selection of functions and there exists a multitude of available benchmark-functions to run your optimizer across. From the depths of the Styblinski-Tang function to the beautiful peeks of Rastrigin. Some of which are ready made and implemented like in [1].

As you might have noticed the functions work on R^n and we provide an input vector \textbf{x}\in R^n. We will only be looking at two dimensions R^2 and three dimensions R^3 to make the display easier.

Now that we have set-up a playground we can just load up our tool of choice like scipy with its optimization library. Our goal is to capture how the optimizer finds the minimum on the benchmark function given different starting points. We just pick and choose any optimizers like simple gradient descent “CG” or the more intricate BFGS. It goes like this:

from scipy.optimize import minimize

# import benchmark function with its derivatives
from scipy.optimize import rosen
from scipy.optimize import rosen_der
from scipy.optimize import rosen_hess

import numpy as np

def banana(x, y):
""" 
a nice 2D way to look at Rosenbrock 
"""
    return (1 - x)**2 + (100 * (y - (x**2))**2)

# we select some interesting optimizers:
OPTIMIZERS = ['CG', 'BFGS', 'Newton-CG', 'dogleg']
FUNCTIONS = {'rosen': [rosen, rosen_der, rosen_hess, banana]
# feel free to add or implement ellipsoid or other benchmark functions here with their derivatives
}

start_values = np.random.uniform(low=x_min, high=x_max, size=(10, 2))

start_iter = 1
max_iter = 50
step = 1
x_min, x_max = -5, 5
y_min, y_max = -5, 5

def optimize_funct(fname, funct, x0, derivative, hess, optimizer, iteration):
"""
Just a wrapper around the scipy minimize function
"""
  fitted_min =  minimize(funct, x0=x0, method=optimizer, 
                         jac=derivative, hess=hess,
                         options={'maxiter':iteration})
  return fitted_min

# run actual optimization
fitted_optimizers = {}
for opt in OPTIMIZERS:
  fitted_optimizers[opt] = {}
  for name, functs in FUNCTIONS.items():
    fitted_optimizers[opt][name] = []
    f = functs[0]
    fd = functs[1]
    fdd = functs[2]
    f_2d = functs[3]
    for vals in start_values:
      computed_values = []
      x, y = vals
      z = f_2d(x,y)
      # include start values before optimization
      computed_values.append(np.array([x,y,z]))
      for i in range(start_iter, max_iter, step):
        out = optimize_funct(fname=name, funct=f, x0=vals, derivative=fd, hess=fdd,
                            optimizer=opt, iteration=i)
        # only save the output values (stored under x)
        x, y = out.x
        z = f_2d(x, y)
        computed_values.append(np.array([x, y, z]))
      fitted_optimizers[opt][name].append(np.array(computed_values))

Not all optimizers are created equal. Some require the first and some the second derivative for our optimization functions aka the Jakobian and the Hessian . For an extra brief recap of what those are and how they work you can look here.
One thing to keep in mind is, that we start at different starting positions and let the optimizer roam freely – we are not doing constraint optimization just yet.

Fig. 2 – Contour plot of the Ellipsoid function from different starting points using standard gradient descend. Each step is a “+”. The optimizer finds the global minimum after a few steps.
Fig. 3 – Contour plot of the Rosenbrock- (Banana-) function. Gradient descent first walks into the right direction (see black “+”), but it looks for the global minimum to no avail. The valley it traverses is already low in function-value and information on where to look next is not as informative.
Fig 4. Attr. Sector contour plot of the function. BFGS optimization on the function space. Given three starting points (three colors) each step is depicted by a black cross. The minimum is quickly discovered and we see how a linear method (CG) behaves differently compared to BFGS which is from the family of quasi-newton methods.

Personally it helped me a lot to visualize how the optimization procedures move through the different function-spaces. Here you can see how CG and BFGS perform. Now CG is scipy’s implementation of gradient descent . It is the plain and simple optimization procedure and a classical line-search method. BFGS on the other hand uses approximated higher order information. BFGS therefore classifies as a quasi-Newton method.

A quasi-Newton method approximates higher order information like the Hessian Matrix (second order derivatives) and makes improvements per each step taken. [2]

Line-Searching | Backtrack Everything.

One method of optimization is simply following a line given the underlying function (see Fig. 2 and Fig. 3). We keep some information through backtracking, but that is about it. In order to optimize we may utilize first derivative information of the function. An intuitive formulation of line search optimization with backtracking is:

  1. Compute gradient at your point
  2. Compute the step based on your gradient and step-size
  3. Take a step in the optimizing direction
  4. Adjust the stepsize by a previously defined factor e.g. \alpha
  5. repeat until minimum is found or the difference in value between your steps is very (very) small – e.g. 10^{-8}
Gradients give us information on how the slope of a function behaves. We take the gradient as the first-order-derivative or partial derivative of a function. It is depicted as a vector and defined as \nabla f(x)=[\frac{\delta f}{\delta x_1}\ldots\frac{\delta f}{\delta x_n}]^T. 

Trust-Region Optimizers | Thinking in Circles

Instead of just looking straight ahead and straight back we can also look around us in a 360 degree fashion. We look at our function-information like the Jacobian and Hessian and make an optimal step in the right direction. In order for this to work we consider the quadrativ model function m_k(p)=f_k+g_k^Tp+\frac{1}{2}(p^TBp) where g is the gradient of f and B the Hessian.
When we do the optimization we optimize for p and adjust the radius that is around us appropriately. This looks like:

Fig. 5 Trust-Region Optimization on the contour of the attr.-sector function. Each step is depicted as “+” and around each step is the region from which the next step is chosen as a circle.

An intuitive formulation of the procedure can look like this
(see the Appendix for a formal representation):

  1. Compute Hessian and Gradient of your function
  2. Choose a point in a radius around your position that is optimal
  3. Take a step given the computed information
  4. Adjust the size of the radius given on how much or how little you have improved
  5. Repeat until convergence or tolerance is reached

Of course these short steps don’t quite capture the complexity, you need more parameters and a bit more math to find the right point in your radius, adjust the step and the radius if that is interesting to you see Alg. 1 and 2 in the Appendix.

We can observe an algorithm in practice and look at how the radius changes over time:

Fig. 6 – magnitude changes of function values (y axis) over steps taken (x axis). The radius \Delta changes with the improvement of the function-values.

We see that if the function values improve over time then our radius gets smaller (see Fig. 5 and 6) whereas if the values are bad initially we increase our radius. See the first steps of the optimization on the attr. sector function in Fig. 6.

Conclusion | The Rabbit Hole

We have seen how two families of optimizers work: the straight-lined linear optimizer and the trust-region optimizer. This is only the entrance of the rabbit hole. With each function and problem we look at we can discover something new. We can analyze how our Jacobian or Hessian matrix have to behave to make our optimizer do its work. We have to look at the shape and properties of our functions.
Obviously there exists a vast number of available proofs and theorems that are the mathematical backbone of what we have just looked at.
If you want to dive deep and get your hands on the raw mathematics I recommend going through the book Numerical Optimization (2nd ed.) by J. Nocedal and S. J. Wright [3].


Acknowledgements

Much of the work that I have shown here I had done in attending the Numerical Optimization course at KU – DIKU. Without the guidance of the course-officials this would not have been possible.

References

[1] MathWorks Test functions for global optimization algorithms
[2] C. Geiger and C. Kanzow . Quasi-Newton-Verfahren Springer 1999
[3] J. Nocedal and S. J. Wright Numerical Optimization (2nd ed.) Springer 2006

Appendix

Trust Region Algorithm

Algorithm for the region of trust optimization in Fig. 5
Alg. to solve for p – solving for lambda is another subproblem.