How your model is optimized | Know your Optimization

The answer to: “Why is my model running forever?” or the classic: “I think it might have converged?”

The driver behind a lot of models that the average Data Scientist or ML-engineer uses daily, relies on numerical optimization methods. Studying the optimization and performance on different functions helps to gain a better understanding of how the process works.
The challenge we face on a daily basis is that someone gives us a model of how they think the world or their problem works. Now, you as a Data Scientist have to find the optimal solution to the problem. For example you look at an energy-function and want to find the absolute, global minimum for your tool to work or your protein to be stable. Maybe you have modeled user-data and you want to find the ideal customer given all your input-features — hopefully continuous ones.
Here is a comprehensive summary of how an optimizer works under the hood:

Benchmark Functions – the curved, the bent and the asymmetric

Before we meet our optimizers we have to get acquainted with the functions that we intend to optimize. My three personal favorites are:

  1. Mr. Nice-Guy is the ellipsoid function. It is overall nicely behaved, convex and has a global minimum.
    it is defined as f_1(x)=\sum_{i=1}^d\alpha\frac{i-1}{d-1}(x_i)^2
  2. The Rosenbrock or also banana-function. It is not as nice as our f_1 – through its valley of-death it can be a challenge to optimization algorithms. It is defined as f_2(x)=(1-x_1)^2+100*(x_2-x_1^2)^2.
  3. The twisted-one is our attractive-sector function. It is the most unusual of the selection as it is non-convex and asymmetric around the minimum. It is defined as f_3(x)=\sum_{i=1}^d h(x_i)^2+100(-x_i)^2 where h(x)=\frac{\log(1+\exp(qx))}{q}.

This is comprehensible selection of functions and there exists a multitude of available benchmark-functions to run your optimizer across. From the depths of the Styblinski-Tang function to the beautiful peeks of Rastrigin. Some of which are ready made and implemented like in [1].

As you might have noticed the functions work on R^n and we provide an input vector \textbf{x}\in R^n. We will only be looking at two dimensions R^2 and three dimensions R^3 to make the display easier.

Now that we have set-up a playground we can just load up our tool of choice like scipy with its optimization library. Our goal is to capture how the optimizer finds the minimum on the benchmark function given different starting points. We just pick and choose any optimizers like simple gradient descent “CG” or the more intricate BFGS. It goes like this:

from scipy.optimize import minimize

# import benchmark function with its derivatives
from scipy.optimize import rosen
from scipy.optimize import rosen_der
from scipy.optimize import rosen_hess

import numpy as np

def banana(x, y):
a nice 2D way to look at Rosenbrock 
    return (1 - x)**2 + (100 * (y - (x**2))**2)

# we select some interesting optimizers:
OPTIMIZERS = ['CG', 'BFGS', 'Newton-CG', 'dogleg']
FUNCTIONS = {'rosen': [rosen, rosen_der, rosen_hess, banana]
# feel free to add or implement ellipsoid or other benchmark functions here with their derivatives

start_values = np.random.uniform(low=x_min, high=x_max, size=(10, 2))

start_iter = 1
max_iter = 50
step = 1
x_min, x_max = -5, 5
y_min, y_max = -5, 5

def optimize_funct(fname, funct, x0, derivative, hess, optimizer, iteration):
Just a wrapper around the scipy minimize function
  fitted_min =  minimize(funct, x0=x0, method=optimizer, 
                         jac=derivative, hess=hess,
  return fitted_min

# run actual optimization
fitted_optimizers = {}
for opt in OPTIMIZERS:
  fitted_optimizers[opt] = {}
  for name, functs in FUNCTIONS.items():
    fitted_optimizers[opt][name] = []
    f = functs[0]
    fd = functs[1]
    fdd = functs[2]
    f_2d = functs[3]
    for vals in start_values:
      computed_values = []
      x, y = vals
      z = f_2d(x,y)
      # include start values before optimization
      for i in range(start_iter, max_iter, step):
        out = optimize_funct(fname=name, funct=f, x0=vals, derivative=fd, hess=fdd,
                            optimizer=opt, iteration=i)
        # only save the output values (stored under x)
        x, y = out.x
        z = f_2d(x, y)
        computed_values.append(np.array([x, y, z]))

Not all optimizers are created equal. Some require the first and some the second derivative for our optimization functions aka the Jakobian and the Hessian . For an extra brief recap of what those are and how they work you can look here.
One thing to keep in mind is, that we start at different starting positions and let the optimizer roam freely – we are not doing constraint optimization just yet.

Fig. 2 – Contour plot of the Ellipsoid function from different starting points using standard gradient descend. Each step is a “+”. The optimizer finds the global minimum after a few steps.
Fig. 3 – Contour plot of the Rosenbrock- (Banana-) function. Gradient descent first walks into the right direction (see black “+”), but it looks for the global minimum to no avail. The valley it traverses is already low in function-value and information on where to look next is not as informative.
Fig 4. Attr. Sector contour plot of the function. BFGS optimization on the function space. Given three starting points (three colors) each step is depicted by a black cross. The minimum is quickly discovered and we see how a linear method (CG) behaves differently compared to BFGS which is from the family of quasi-newton methods.

Personally it helped me a lot to visualize how the optimization procedures move through the different function-spaces. Here you can see how CG and BFGS perform. Now CG is scipy’s implementation of gradient descent . It is the plain and simple optimization procedure and a classical line-search method. BFGS on the other hand uses approximated higher order information. BFGS therefore classifies as a quasi-Newton method.

A quasi-Newton method approximates higher order information like the Hessian Matrix (second order derivatives) and makes improvements per each step taken. [2]

Line-Searching | Backtrack Everything.

One method of optimization is simply following a line given the underlying function (see Fig. 2 and Fig. 3). We keep some information through backtracking, but that is about it. In order to optimize we may utilize first derivative information of the function. An intuitive formulation of line search optimization with backtracking is:

  1. Compute gradient at your point
  2. Compute the step based on your gradient and step-size
  3. Take a step in the optimizing direction
  4. Adjust the stepsize by a previously defined factor e.g. \alpha
  5. repeat until minimum is found or the difference in value between your steps is very (very) small – e.g. 10^{-8}
Gradients give us information on how the slope of a function behaves. We take the gradient as the first-order-derivative or partial derivative of a function. It is depicted as a vector and defined as \nabla f(x)=[\frac{\delta f}{\delta x_1}\ldots\frac{\delta f}{\delta x_n}]^T. 

Trust-Region Optimizers | Thinking in Circles

Instead of just looking straight ahead and straight back we can also look around us in a 360 degree fashion. We look at our function-information like the Jacobian and Hessian and make an optimal step in the right direction. In order for this to work we consider the quadrativ model function m_k(p)=f_k+g_k^Tp+\frac{1}{2}(p^TBp) where g is the gradient of f and B the Hessian.
When we do the optimization we optimize for p and adjust the radius that is around us appropriately. This looks like:

Fig. 5 Trust-Region Optimization on the contour of the attr.-sector function. Each step is depicted as “+” and around each step is the region from which the next step is chosen as a circle.

An intuitive formulation of the procedure can look like this
(see the Appendix for a formal representation):

  1. Compute Hessian and Gradient of your function
  2. Choose a point in a radius around your position that is optimal
  3. Take a step given the computed information
  4. Adjust the size of the radius given on how much or how little you have improved
  5. Repeat until convergence or tolerance is reached

Of course these short steps don’t quite capture the complexity, you need more parameters and a bit more math to find the right point in your radius, adjust the step and the radius if that is interesting to you see Alg. 1 and 2 in the Appendix.

We can observe an algorithm in practice and look at how the radius changes over time:

Fig. 6 – magnitude changes of function values (y axis) over steps taken (x axis). The radius \Delta changes with the improvement of the function-values.

We see that if the function values improve over time then our radius gets smaller (see Fig. 5 and 6) whereas if the values are bad initially we increase our radius. See the first steps of the optimization on the attr. sector function in Fig. 6.

Conclusion | The Rabbit Hole

We have seen how two families of optimizers work: the straight-lined linear optimizer and the trust-region optimizer. This is only the entrance of the rabbit hole. With each function and problem we look at we can discover something new. We can analyze how our Jacobian or Hessian matrix have to behave to make our optimizer do its work. We have to look at the shape and properties of our functions.
Obviously there exists a vast number of available proofs and theorems that are the mathematical backbone of what we have just looked at.
If you want to dive deep and get your hands on the raw mathematics I recommend going through the book Numerical Optimization (2nd ed.) by J. Nocedal and S. J. Wright [3].


Much of the work that I have shown here I had done in attending the Numerical Optimization course at KU – DIKU. Without the guidance of the course-officials this would not have been possible.


[1] MathWorks Test functions for global optimization algorithms
[2] C. Geiger and C. Kanzow . Quasi-Newton-Verfahren Springer 1999
[3] J. Nocedal and S. J. Wright Numerical Optimization (2nd ed.) Springer 2006


Trust Region Algorithm

Algorithm for the region of trust optimization in Fig. 5
Alg. to solve for p – solving for lambda is another subproblem.