Optimizing Hyperparameters the right Way

Efficiently exploring the parameter-search through Bayesian Optimization with skopt in Python. TL;DR: my hyperparameters are always better than yours.

Explore vast canyons of the problem space efficiently – Photo by Fineas Anton on Unsplash

In this post we will build a machine learning pipeline using multiple optimizers and use the power of Bayesian Optimization to arrive at the most optimal configuration for all our parameters. All we need is the sklearn Pipeline and Skopt.
You can use your favorite ML models, as long as they have a sklearn wrapper (looking at you XGBoost or NGBoost).

About Hyperparameters

The critical point for finding the best models that can solve a problem are not just the models. What we need is to find the optimal parameters to make your model work optimally, given the dataset. This is called finding or searching hyperparameters.

For example, we would like to implement a Random Forest in practice and its documentation states:

class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, ...

Then every of those parameters can be explored. This can include all possible number of estimators ( n_estimators ) in the forest from 1 to 10.000, if you would like to split using {“gini”, “entropy”}, or the maximum depth of your trees and many, many more options. Each of those parameters can influence your model‘s performance and worst of all most of the time you do not know the right configuration when you’re starting out with a new problem-set.

The Jack-Hammer aka Grid-Search

The brute-force way to find the optimal configuration is to perform a grid-search for example using sklearn’s GridSearchCV. This means that you try out all possible combinations of parameters on your model.
On the bright side, you might find the desired values. The problem is that the runtime is horrible and over all grid-searching does not scale well. For every new parameter you want to try out, you will test all of the other previously specified parameters also.
After all, this is very uninformative and we have the intuition that some picks of parameters are more informative than others – right?
We don’t need to test for all parameters – especially not those of which we know that they are far off.
One step in the right direction is randomized search like RandomizedSearchCV, where we pick the parameters randomly while moving in the right direction.

Better Bayesian Search

Our tool of choice is BayesSearchCV. This approach uses stepwise Bayesian Optimization to explore the most promising hyperparameters in the problem-space.
Very briefly, Bayesian Optimization finds the minimum to an objective function in large problem-spaces and is very applicable to continuous values. To do this it uses Gaussian Process regression on the objective function under the hood. A thorough mathematical introduction can be found in [2].
The Bayesian Optimization approach gives the benefit that we can give a much larger range of possible values, since over time we automatically explore the most promising regions and discard the not so promising ones.
Plain grid-search would need ages to stupidly explore all possible values.
Since we move much more effectively, we can allow for a much larger playing field.
Let’s look at an example.

The Problemset

Today we use the diabetes dataset from sklearn for ease of use.
This saves us the trouble of loading and cleaning our data and the features are already neatly encoded.

Fig. 1 – Overview over the dataset.

We have a set of encoded columns from age, sex, bmi, blood-pressure and serum values, numerically encoded. Our target value is a measure of the disease progression.
We can see some interactions on the s-columns, indicating correlation for the blood-values.

Fig. 2 – Excerpt from a pairplot examining parameter interactions over all data.

To build our pipeline we first split our dataset into training and testing respectively in a 80:20 split.

X_train, X_test, y_train, y_test = train_test_split(diabetes_df.drop(columns="target"), diabetes_df.target, test_size=0.2, random_state=21)

Build a Pipeline

We need three elements to build a pipeline: (1) the models to be optimized, (2) the sklearn Pipeline object, and (3) the skopt optimization procedure.

First, we choose two boosting models: AdaBoost and GradientBoosted regressors and for each we define a search space over crucial hyperparameters. Any other regressor from the depth of the sklearn library would do, but boosting might win you the next hackathon (…its 2015 isn’t it?)
The search-space is a dictionary with the key-value pair := { ‘model__parameter‘ : skopt.space.Object}. For each parameter we set a space from the skopt library in the range that we want. Categorical values are also included by passing them as a list of strings (see Categorical below):

ada_search = {
    'model': [AdaBoostRegressor()],
    'model__learning_rate': Real(0.005, 0.9, prior="log-uniform"),
    'model__n_estimators': Integer(1, 1000),
    'model__loss': Categorical(['linear', 'square', 'exponential'])
}

gb_search = {
    'model': [GradientBoostingRegressor()],
    'model__learning_rate': Real(0.005, 0.9, prior="log-uniform"),
    'model__n_estimators': Integer(1, 1000),
    'model__loss': Categorical(['ls', 'lad', 'quantile'])
}

Second, we select over which regression model to pick through another model, this is our pipeline element, where both optimizers (adaboost and gradientboost) come together for selection:

pipe = Pipeline([
                 ('model', GradientBoostingRegressor())
])

Third, we optimize over our searchspace. For this we invoke the BayesSearchCV. We also specify how the optimizer should call our search-space. In our case that is 100 invokations. Then we fit the pipeline with a simple skopt .fit() command:

opt = BayesSearchCV(
    pipe,
    [(ada_search, 100), (gb_search, 100)],
    cv=5
)

opt.fit(X_train, y_train)

After the fitting is done, we can then ask for the optimal found parameters. This includes using the score function on the unseen test-data.

Fig. 3 – Evaluating the fitted pipeline

We can see that the validation and test-score. The model with the best results is the AdaBoost model with a linear loss and 259 estimators at a learnin-rate of 0.064. Neat.

From the output we can see that the optimizer uses a GP for optimization under the hood

Fig. 4 – The insights of the optimizer results with a peak into the model.

For illustrative purposes we can also call GP minimization specifically – we’ll use the AdaBoost regressor for that – only focusing on a numerical hyperparameter space.

ada = AdaBoostRegressor(random_state=21)

# numerical space
space  = [Real(0.005, 0.9, "log-uniform", name='learning_rate'),
          Integer(1, 1000, name="n_estimators")]

The major change here is that we have to define an objective function over which we have to optimize. In order to see how the model performs over the parameter-space we use the named arg decorator.

@use_named_args(space)
def objective(**params):
    ada.set_params(**params)
    return -np.mean(cross_val_score(ada, X_train, y_train, cv=5, n_jobs=-1,
                                    scoring="neg_mean_absolute_error"))

Therefore our optimization is the negative mean score that we get from the cross-validated fit while using the negative mean absolute error . Other scoring functions might suit you better.
We then use GP minimzation to fit the most optimal parameters for our regressor.

gp_minimize(objective, space, n_calls=100, random_state=21)

Visualize the problem space – post-optimization

Using GP optimization directly allows us to plot convergence over the minimization process.

Fig. 5 – Convergence of GP minimization while finding the optimal hyperparameters of the AdaBoost regressor with respect to the target column in the dataset.

We can see that the min in the function value has already converged after around 40 iterations .
The last excellent feature is visualizing the explored problem space. Since we used only numerical input at this point, we can then evaluate the problem space by plotting it using skopt, like so:

Fig. 6 – Visualizing the invoked learning_rate and n_estimator values over the problem space as histograms. The minimum is marked by the red star in the scatterplot.

From the figure above we can see that a lot of exploration was done on the lower end of our learning-rate spectrum, and the peaks of the tried and tested estimator number was 200 and over 800.
The scatterplot paints a picture of how difficult the problem space is that we are trying to optimize – with the minimum marked with a red star. Though we have reached convergence not a lot of values are evident in the region around the minimum and increasing the number of evaluations can be considered.

Now that we have the best model given our parameter space we can implement that model accordingly and go into its specific analysis.
As you have seen skopt is a library that offers a lot in the realm of optimizations and I encourage you to use them in your daily ML engineering.

Happy Exploring!

The complete code can be found in a Notebook here.

References

  1. Marc ClaesenBart De Moor. Hyperparameter Search in Machine Learning. 2015 on arXiv.
  2. Peter I. Frazier. A Tutorial on Bayesian Optimization. 2018 on arXiv.
  3. scikit-optimize contributors (BSD License). Scikit-learn hyperparameter search wrapper.