Nonlinear Optimization

Table of Contents

Readings and References:


Introduction

Many economic and econometric problems can be formulated as optimization (minimization and maximization) problems with or without constraints. In most cases, simple equality constraints are substituted into the objective function so that the problems are essentially the unconstrained optimization. Inequality constraints are possible and will make the problem more difficult to solve.

Scalar-Valued Function of One or Two Variables

Taylor Approximation: A Review

Given a n-variable scalar-valued function f: Rn -> R, the 1xn gradient vector of f (1st derivatives) evaluated at x0 is g(x0), and the corresponding nxn Hessian matrix of f (2nd derivatives) is H(x0). The first-order (linear) Taylor approximation of f at x0 is

f(x) = f(x0) + g(x0) (x-x0)

Similarly, the second-order (quadratic) Taylor approximation of f at x0 is

f(x) = f(x0) + g(x0) (x-x0) + ½ (x-x0)' H(x0) (x-x0)

Unconstrained Optimization

A typical unconstrained optimization problem is solved as follows:

max f(x) min f(x)
x0 is the maximum: x0 is the minimum:
f(x0) ³ f(x) for all x
g(x0) = 0
H(x0) is negative definite
f(x0) £ f(x) for all x
g(x0) = 0
H(x0) is positive definite

From the first-order approximation of f at x0, we have

f(x) = f(x0) + g(x0) (x-x0), or

f(x0+Dx) - f(x0) = g(x0) Dx,

where Dx = x-x0. Dx is called the step of optimization.

If x0 is not the optimum (maximum or minimum), then the function value may be improved by taking the step Dx in accordance with the direction of gradient g(x0).

max f(x) min f(x)
find Dx such that f(x0+Dx) - f(x0) > 0: find Dx such that f(x0+Dx) - f(x0) < 0:
from the 1st-order approximation,
g(x0) Dx > 0, i.e.
if g(x0) > 0, Dx > 0;
if g(x0) < 0, Dx < 0
from the 1st-order approximation,
g(x0) Dx < 0, i.e.
if g(x0) > 0, Dx < 0;
if g(x0) < 0, Dx > 0

Cauchy Step

Setting Dx = g(x0)' for maximization or Dx = -g(x0)' for minimization is the so-called Cauchy Step.

Newton Step

If f(x) is approximated at x0 by the second order, f(x) = f(x0) + g(x0) (x-x0) + ½ (x-x0)' H(x0) (x-x0). Suppose x is the optimum (maximum or minimum), from the first-order condition of optimization: f(x)/x = 0. That is,

g(x0)+Dx' H(x0) = 0, or Dx = -H(x0)-1 g(x0)'.

Here the optimal step = -H(x0)-1 g(x0)' is the so-called Newton Step. Note that H(x0) is a negative definite (positive definite) matrix when x0 is near the maximum (minimum) x.

Numerical Computation Method

Step Size and Direction

Finding the solution x is essentially the task of searching for the optimal step Dx so that the first and second order conditions of function optimization are satisfied. Let Dx = s d: s is the step size (>0), and d is the step direction. Various methods of optimization are available depending on the choice of s and d. In general, the direction d is determined according to the gradient vector g':

max f(x)min f(x)
Dx = s d D = s d
s > 0
d = M g'
M is positive definite
s > 0
d = -M g'
M is positive definite

Cauchy type direction is obtained by setting M = I, the identity matrix. Newton type direction is obtained by setting M = -H-1. Note that the Hessian H is negative definite for maximization and positive definite for minimization.

In searching for an optimum, the following typical iterative procedure is used:

for the i+1-th iteration, xi+1 = xi + si di, where di = ± Mi gi' with Mi being positive definite.

Given the initial value x0 and step direction d0, determine a step size s0. Then compute x1 = x0 + s0 d0. Here d0 is method dependent. The iteration looks like this:

specify         compute             check      (yes)
x0 ---> s0 ---> x1 = x0 + s0 d0 ---> convergence ---> stop
      |              |                             | (no)
      |              |                             | 
      d0           s1 <------------------- compute d1
How a step size, s0 or s1 above, is determined?
Given x0 and d0, find s>0 so that f(x0+s d0) is maximized (or minimized). This is an one-dimensional line search problem.
What is the optimal size of a step (s>0)?
From the second-order approximation:

f(x) = f(x0) + g(x0) (x-x0) + ½ (x-x0)' H(x0) (x-x0) or

f(x0+s d) = f(x0) + g(x0) s d + ½ s2 d' H(x0) d.

Maximize (or minimize) the above function with respect to s, the first-order condition is

g(x0) d + s d' H(x0) d = 0.

Then the optimal step size is s = -(g(x0) d)/(d' H(x0) d). For Cauchy type direction, the optimal step size is

s = -(g g')/(g H g').

For Newton type direction, the optimal step size is exactly 1.

In practice, the optimal step size is rarely useful. Here a simple bi-directional linear search method is considered instead:

  1. Given the step direction d and a specified step size s, that is x = x0 + s d. If the function value f(x) improves over the previous f(x0), the step size in use could be enlarged:

    Repeat s = 1.1 s until no further improvement on f(x).

  2. On the other hand, f(x) could be worse than f(x0). Then the step size in use should be contracted:

    Repeat s = s/2 until an improvement of f(x) is found. If f(x) can not be improved during this contraction process, f(x) = f(x0) is already an optimum or a saddle point.

(i) and (ii) may be repeated and iterated. Alternative methods are available for searching the proper step size. For example, see Berndt, Hall, Hall, and Hausman [1974].

Optimization Methods

Since di = ± Mi gi', the positive definite matrix Mi is determined in terms of different optimization methods used:

Optimization Method Matrix M
Steepest ascent (descent) method I
BFGS method see (2) below
DFP method see (2) below
Newton-Raphson method -H-1
Greenstadt method (S(|wi| vi vi'))-1 or
S(|1/wi| vi vi'), see (1) below
Quadratic-hill climbing -(H-r I)-1, see (1) below
Modified quadratic-hill climbing -(H-r A)-1, see(1) below

Explanation Notes:
  1. Since d = ± M g' with M positive definite at optimum. However, for some iteration i, Mi may not be positive definite. Various methods have been designed to deal with non-positive definiteness of Mi. This includes but not limited to

    For the Greenstadt method, let wi and vi be the i-th eigenvalue and its corresponding eigenvector of a positive definite matrix. With all wi > 0, the matrix is S wi vi vi'. Its inverse is S (1/wi) vi vi' which is also positive definite. If some of wi is negative, use |wi| to replace wi.

    For the quadratic-hill climbing methods, r ³ 0 forces H - rI or H - rA to be negative definite (for maximization). Indeed, r = 0 if H is already negative definite, and r is greater than the largest eigenvalue of H otherwise. For the modified quadratic-hill climbing method, A is the transformation matrix for optimal search over an ellipsoidal region. The computation of Hessian and its eigenvalues and eigenvectors may be expensive.

  2. Newton method and its variations require the computation of Hessian matrix H for each iteration, a class of quasi-Newton methods using only the gradient vector are available:

    The idea is to approximate H-1 during iterations: let N be the matrix approximating H-1, Dxi = xi+1 - xi, and Dgi = gi+1 - gi, then the Broyden (rank one correction) method is to compute Ni+1 from Ni according to

    Ni+1= Ni +
    (Dxi-NiDgi')(Dxi-NiDgi')'

        (Dxi-NiDgi)'Dgi'

    The formula of Ni+1 for DFP (rank two correction) method is

    Ni+1= Ni +
    DxiDxi'

    Dxi'Dgi'
    -
    NiDgi'DgiNi

     DgiNiDgi'

    The formula of Ni+1 for BFGS (rank two correction) method is

    Ni+1= Ni
    æ
    ç
    è
    1 +
    DgiNiDgi'

     Dxi'Dgi'
    ö æ
    ÷ ç
    ø è
    DxiDxi'

    Dxi'Dgi'
    ö
    ÷
    ø
    -
    DxiDgiNi+NiDgi'Dxi'

         Dxi'Dgi'

Convergence Criteria

Since the numeric optimization methods are iterative in nature, certain convergence criteria are needed to stop the iterations when a solution is found. Given a tolerance level, for two iterations i and i+1,
  1. f(xi+1) ³ f(xi) for maximization, or
    f(xi+1) £ f(xi) for minimization;
    ||f(xi+1) - f(xi)|| --> 0.
  2. ||xi+1 - xi|| --> 0 and (1).
  3. ||g(xi)||, ||g(xi+1)|| --> 0 and (1), (2).
    Note: ||g(xi)|| = g(xi)H(xi)-1g(xi)', if H(xi) is evaluated.

Example 1: Solving Mathematical Function

Find four minima of the following function (Program):

f(x,y) = (x2+y-11)2 + (x+y2-7)2

Example 2: Estimating Probability Distributions

The characteristics of a random variable (e.g. mean and variance, etc.) may be evaluated through the joint probability density of a finite sample. This joint density function, or the likelihood function, is defined as the product of N independent density functions f(Xi,q) of data observations Xi (i=1,2,...,N) and a parameter vector q. That is,

Õi=1,2,...,N f(Xi,q),

or equivalently in log form:

ll(q) = åi=1,2,...,N ln f(Xi,q)

The problem is to maximize the log-likelihood function ll(q) so that the solution q characterizes the probability distribution of the random variable X under consideration. To find the q that maximizes ll(q) is the essence of maximum likelihood estimation. The corresponding variance-covariance matrix of q is derived from the information matrix (negatives of the expected values of the second derivatives) of the log-likelihood function as follows:

Var(q) = [-E(2ll/¶q¶q')]-1

Normal Distribution

The familiar example is the likelihood function derived from a normal probability distribution:

f(X,q) = 1/Ö(2ps2) exp [(X-m)2/(2s2)]

Where q = (m,s2) represents the distribution parameters. It is straightforward to show that the maximum likelihood solution is m = E(X) = 1/N åi=1,...,N Xi, the sample mean; and s2 = Var(X) = 1/N åi=1,...,N (Xi-m)2, the sample variance.

Log-Normal Distribution

Another example is based the log-normal distribution of X (or equivalently, normal distribution of ln(X)) defined as:

f(X,q) = 1/Ö(2ps2) (1/X) exp [(ln(X)-m)2/(2s2)]

With the solution m = 1/N åi=1,...,Nln(Xi) and s2 = 1/N åi=1,...,N(ln(Xi)-m)2, the corresponding mean and variance of X is E(X) = exp(m+s2/2) and Var(X) = exp(2m+s2)[exp(s2)-1], respectively. Many economic variables are described with a log-normal instead of a normal probability distribution.

Gamma Distribution

Of course, maximum likelihood estimation is not limited to models with normal or log-normal probability distribution. In many situations, the probability distribution of a random variable may be non-normal or unknown. For example, to estimate the gamma distribution of a nonnegative random variable X ³ 0, the distribution function is

f(X,q) = lr/G(r) e-lX X r-1

where q = (l, r) is the parameter vector with l > 0 and r > 0. The mean of X is r/l, and the variance is r/l2. Many familiar distributions, such as the exponential and Chi-square distributions, are special cases of the gamma distribution.

As with the normal distribution, the technique of maximum likelihood can be used to estimate the parameters of the gamma distribution. Sampling from N independent observations from the gamma distribution, the log-likelihood function is:

ll(q) = N [rln(l) - lnG(r)] - l åi=1,2,...,N Xi + (r-1) åi=1,2,...,N ln(Xi).

With the normal, log-normal, and gamma probability distributions, the characteristics of the random variable X may be described in terms of the estimated mean and variance for each probability distribution as follows:

Normal
Distribution
Log-Normal
Distribution
Gamma
Distribution
Mean
E(X)
mexp(m+s2/2)r/l
Variance
Var(X)
s2exp(2m+s2)[exp(s2)-1]r/l2
Where:m = 1/N åi=1,...,NXi
s2 = 1/N åi=1,...,N(Xi-m)2
m = 1/N åi=1,...,Nln(Xi)
s2 = 1/N åi=1,...,N(ln(Xi)-m)2

Based on 20 observations of a hypothetical data series INCOME of Greene [1999, Chapter 4, Table 4.1], estimate its mean and variance under the assumption of three probability distributions. First we need to define and represent the log-likelihood function for each of probability distributions. Then we estimate the parameters of each probability distribution by maximizing the corresponding log-likelihood function respectively (Program and Data).

Example 3: Mixture of Probability Functions

It is possible that a random variable is drawn from a mixture of probability distributions (two or more, same or different types of distributions). For simple exploration, consider X is distributed with a mixture of two normal distributions:

f1(X,m1,s1) = 1/Ö(2ps12) exp [(X-m1)2/2s12] and
f2(X,m2,s2) = 1/Ö(2ps22) exp [(X-m2)2/2s22]

Then the likelihood function is

f(X,q) = l f1(X,m1,s1) + (1-l) f2(X,m2,s2)

where l is the probability that an observation is drawn from the first distribution f1(X,m1,s1), and 1-l is the probability of that drawn from the second distribution. q = (m1,m2,s1,s2,l)' is the unknown parameter vector need to be estimated.

Continue from the previous example. Suppose each observation of the variable INCOME is drawn from one of two different normal distributions. Formulate and maximize the log-likelihood function for solving the parameters of two normal distributions (Program).

Example 4: Minimizing Sum-of-Squares Function

This is the example taken from Judge, et. al. [1988], Chapter 12, p. 512. Fit the following two production functions based on 30 data observations of labor L, capital K, and output Q given in Table 12.3:
  1. Cobb-Douglas Production Function
    ln(Q) = b1 + b2ln(L) + b3ln(K) + e

  2. CES Production Function
    ln(Q) = b1 + b4ln(b2Lb3 + (1-b2)Kb3) + e

The method of least squares estimation is to find b = (b1,b2,b3,b4)' so that the sum of squared errors S = e'e = Si=1,...,30ei2 is minimized (Program).

Constrained Optimization

In many economic and econometric applications, the optimization problems are likely to be constrained. There are equality constraints and inequality constraints. Although the former in simpler situations could be substituted into the objective function through reparametrization, the later is more difficult to handle which may include a set of boundary conditions on the estimated parameters.

Equality Constraint

The equality constrained optimization problem is represented as follows:

maximize (or minimize) f(x)
subject to c(x) = 0

In terms of the Lagrangian Function defined by

L(x,l) = f(x) - l' c(x),

it becomes an unconstrained optimization problem for x and l, where l is a vector of Lagrangian Multipliers. For a binding or active constraint, the corresponding element in l must be non-zero.

If x0 is a maximum (or minimum), from the first-order condition

f(x0)/x - l' c(x0)/x = 0
c(x0) = 0

and the second-order (bordered Hessian) condition is stated as:

2f(x0)/x'x is negative (or positive) definite on the subspace tangent to c(x) at x0. That is, the matrix

é 2f(x0)/x'x    c(x) ù
ë c(x)'    0 û
is negative (positive) definite.

Inequality Constraint

A typical inequality constrained optimization problem is given as:

maximize f(x)minimize f(x)
subject to c(x) £ 0 subject to c(x) ³ 0

If x0 is a maximum (or minimum) and l is the vector of Lagrangian multipliers, from the first-order condition of maximizing (or minimizing) the Lagrangian function L(x,l) = f(x) - l' c(x), known as the Kuhn-Tucker Conditions are:

f(x0)/x - l' c(x0)/x = 0
c(x0) £ 0 for maximization (or c(x0) ³ 0 for minimization), and l ³ 0.

The last set of inequality conditions imply that lici(x0) = 0 for each constraint i. In other words, for an active constraint i, that is ci(x) = 0, it must have the corresponding Lagrangian multiplier li > 0.

A general optimization problem may consist of both equality and inequality constraints. The same representation can be used with the vector of inequality constraints to include elements of equality. For numerical computation of the constrained solution, equality and inequality constraints and bounded set restrictions are treated separately for efficiency reason.

Special Cases

  1. Nonnegativity Constraints

    Consider a special problem of bounded minimization:

    minimize f(x) subject to x ³ 0.

    The solution is found from solving the following first-order condition:

    f(x0)/x ³ 0 and x0 ³ 0.

  2. Linear Programming

    A classical LP problem is formulated like this:

    minimize f(x) = p'x
    subject to Ax - b ³ 0, and x ³ 0.

    where the vectors p, b, and the matrix A are the parameters conformable with x.

  3. Quadratic Programming

    A typical QP problem is given as follows:

    minimize f(x) = p'x + ½ x'Qx
    subject to Ax - b ³ 0, and x ³ 0.

    where the vectors p, b, and the matrices Q, A, are the parameters conformable with x.

Using Unconstrained Optimization Methods

Using GAUSS, constrained optimization problems can be solved with the built-in function SQPSOLVE (Similarly, QNEWTON procedure is useful for solving unconstrained problems). However, only the BFGS descent algorithm is available in both GAUSS procedures.

By substituting out (linear or nonlinear) constraints, we can use unconstrained optimization methods to solve a constrained problem. For inequality constraints, parameter transformation may be used. For example, instead of estimating the constrained parameter x, we estimate the unconstrained parameter z and compute x from the continuous transformation x = f(z) as follows:

Parameter
Constraint
Prameter
Transformation
x = f(z), -¥ < z < ¥
x > 0x = z2 or
x = exp(z)
0 < x < 1x = exp(z)/(1+exp(z))
-1 < x < 1x = (exp(z)-1)/(1+exp(z)) = tanh(z/2)
0 < x1 + x2 < 1
x1 > 0, x2 > 0
x1 = exp(z1)/(1+exp(z1)+exp(z2))
x2 = exp(z2)/(1+exp(z1)+exp(z2))

Given the transformation function x = f(z), the objective function f(x) is transformed to F(z) = f(f(z)). Therefore, applying constrained optimization to f with respect to x is equivalent to applying unconstrained optimization to F with respect to z. However, the parameter of interest is x, not z. It is useful to verify the gradient and hessian of the function f with respect to x.

From the unconstrained solution z of F, the associated gradient and hessian are:

F/z = (f/x)(¶f/z) = 0
2F/zz' = (f/x) [2f/zz'] + (¶f/z)' [2f/xx'] (¶f/z)

Therefore,

f/x = (F/z) (¶f/z)-1 = 0, because F/z = 0
2f/xx' = (¶f/z)'-1 [2F/zz'] (¶f/z)-1

Note that the transformation function f: z -> x may be vector-valued, then the gradient ¶f/z is a matrix. The inverse of the gradient matrix is used to transform the hessians.

Example 5

Using GAUSS built-in procedure SQPSOLVE, find the minima of the 2-variable function:

f(x,y) = (x2+y-11)2 + (x+y2-7)2

for x ³ 0, y ³ 0, and constrained with the following nonlinear inequality:

(x-3)2 + (y-2)2 ³ 1

In other words, the nonnegative solution must be the minimum point on the surface above the circle in the x-y plane with unit radius and centered on x = 3, y = 2. (Program)

Example 6

Continue the previous Example 4 of estimating the CES production function:

ln(Q) = b1 + b4ln(b2Lb3 + (1-b2)Kb3) + e

There are several constraints of interest:

Formulate and find the least squares solution of the constrained CES production function, using GAUSS built-in procedure SQPSOLVE (Program).

Example 7: Mixture of Probability Distributions Revisited

The mixture probability l must satisfy the condition that 0 < l < 1. To impose the unit interval restriction on l, we could apply the transformation such as: where -¥ < q < ¥ is unrestricted. Since our interest is the probability parameter l, the estimates of parameter variances should be obtained for l.


Copyright© Kuan-Pin Lin
Last updated: March 28, 2006