Nonlinear Optimization

Table of Contents

Readings and References:


Introduction

Many economic and econometric problems can be formulated as optimization (minimization and maximization) problems with or without constraints. In most cases, simple equality constraints are substituted into the objective function so that the problems are essentially the unconstrained optimization. Inequality constraints are possible and will make the problem more difficult to solve. Using MATA in Stata, optimization with linear constraints is implemented but not for the problems with nonlinear constraints.

Scalar-Valued Function of One or Two Variables

Taylor Approximation: A Review

Given a n-variable scalar-valued function f: Rn -> R, the 1xn gradient vector of f (1st derivatives) evaluated at x0 is g(x0), and the corresponding nxn Hessian matrix of f (2nd derivatives) is H(x0). The first-order (linear) Taylor approximation of f at x0 is

f(x) = f(x0) + g(x0) (x-x0)

Similarly, the second-order (quadratic) Taylor approximation of f at x0 is

f(x) = f(x0) + g(x0) (x-x0) + ½ (x-x0)' H(x0) (x-x0)

Unconstrained Optimization

A typical unconstrained optimization problem is solved as follows:

max f(x) min f(x)
x0 is the maximum: x0 is the minimum:
f(x0) ≥ f(x) for all x
g(x0) = 0
H(x0) is negative definite
f(x0) ≤ f(x) for all x
g(x0) = 0
H(x0) is positive definite

From the first-order approximation of f at x0, we have

f(x) = f(x0) + g(x0) (x-x0), or

f(x0+Δx) - f(x0) = g(x0) Δx,

where Δx = x-x0. Δx is called the step of optimization.

If x0 is not the optimum (maximum or minimum), then the function value may be improved by taking the step Δx in accordance with the direction of gradient g(x0).

max f(x) min f(x)
find Δx such that f(x0+Δx) - f(x0) > 0: find Δx such that f(x0+Δx) - f(x0) < 0:
from the 1st-order approximation,
g(x0) Δx > 0, i.e.
if g(x0) > 0, Δx > 0;
if g(x0) < 0, Δx < 0
from the 1st-order approximation,
g(x0) Δx < 0, i.e.
if g(x0) > 0, Δx < 0;
if g(x0) < 0, Δx > 0

Cauchy Step

Setting Δx = g(x0)' for maximization or Δx = -g(x0)' for minimization is the so-called Cauchy Step.

Newton Step

If f(x) is approximated at x0 by the second order, f(x) = f(x0) + g(x0) (x-x0) + ½ (x-x0)' H(x0) (x-x0). Suppose x is the optimum (maximum or minimum), from the first-order condition of optimization: ∂f(x)/∂x = 0. That is,

g(x0)+Δx' H(x0) = 0, or Δx = -H(x0)-1 g(x0)'.

Here the optimal step = -H(x0)-1 g(x0)' is the so-called Newton Step. Note that H(x0) is a negative definite (positive definite) matrix when x0 is near the maximum (minimum) x.

Numerical Computation Method

Step Size and Direction

Finding the solution x is essentially the task of searching for the optimal step Δx so that the first and second order conditions of function optimization are satisfied. Let Δx = s d: s is the step size (>0), and d is the step direction. Various methods of optimization are available depending on the choice of s and d. In general, the direction d is determined according to the gradient vector g':

max f(x)min f(x)
Δx = s d Δ = s d
s > 0
d = M g'
M is positive definite
s > 0
d = -M g'
M is positive definite

Cauchy type direction is obtained by setting M = I, the identity matrix. Newton type direction is obtained by setting M = -H-1. Note that the Hessian H is negative definite for maximization and positive definite for minimization.

In searching for an optimum, the following typical iterative procedure is used:

for the i+1-th iteration, xi+1 = xi + si di, where di = ± Mi gi' with Mi being positive definite.

Given the initial value x0 and step direction d0, determine a step size s0. Then compute x1 = x0 + s0 d0. Here d0 is method dependent. The iteration looks like this:

specify         compute             check      (yes)
x0 ---> s0 ---> x1 = x0 + s0 d0 ---> convergence ---> stop
      |              |                             | (no)
      |              |                             | 
      d0           s1 <------------------- compute d1
How a step size, s0 or s1 above, is determined?
Given x0 and d0, find s>0 so that f(x0+s d0) is maximized (or minimized). This is an one-dimensional line search problem.
What is the optimal size of a step (s>0)?
From the second-order approximation:

f(x) = f(x0) + g(x0) (x-x0) + ½ (x-x0)' H(x0) (x-x0) or

f(x0+s d) = f(x0) + g(x0) s d + ½ s2 d' H(x0) d.

Maximize (or minimize) the above function with respect to s, the first-order condition is

g(x0) d + s d' H(x0) d = 0.

Then the optimal step size is s = -(g(x0) d)/(d' H(x0) d). For Cauchy type direction, the optimal step size is

s = -(g g')/(g H g').

For Newton type direction, the optimal step size is exactly 1.

In practice, the optimal step size is rarely useful. Here a simple bi-directional linear search method is considered instead:

  1. Given the step direction d and a specified step size s, that is x = x0 + s d. If the function value f(x) improves over the previous f(x0), the step size in use could be enlarged:

    Repeat s = 1.1 s until no further improvement on f(x).

  2. On the other hand, f(x) could be worse than f(x0). Then the step size in use should be contracted:

    Repeat s = s/2 until an improvement of f(x) is found. If f(x) can not be improved during this contraction process, f(x) = f(x0) is already an optimum or a saddle point.

(i) and (ii) may be repeated and iterated. Alternative methods are available for searching the proper step size. For example, see Berndt, Hall, Hall, and Hausman [1974].

Optimization Methods

Since di = ± Mi gi', the positive definite matrix Mi is determined in terms of different optimization methods used:

Optimization Method Matrix M
Steepest ascent (descent) method I
BFGS method see (2) below
DFP method see (2) below
Newton-Raphson method -H-1
Greenstadt method (∑(|wi| vi vi'))-1 or
∑(|1/wi| vi vi'), see (1) below
Quadratic-hill climbing -(H-r I)-1, see (1) below
Modified quadratic-hill climbing -(H-r A)-1, see(1) below

Explanation Notes:
  1. Since d = ± M g' with M positive definite at optimum. However, for some iteration i, Mi may not be positive definite. Various methods have been designed to deal with non-positive definiteness of Mi. This includes but not limited to

    For the Greenstadt method, let wi and vi be the i-th eigenvalue and its corresponding eigenvector of a positive definite matrix. With all wi > 0, the matrix is ∑ wi vi vi'. Its inverse is ∑ (1/wi) vi vi' which is also positive definite. If some of wi is negative, use |wi| to replace wi.

    For the quadratic-hill climbing methods, r ≥ 0 forces H - rI or H - rA to be negative definite (for maximization). Indeed, r = 0 if H is already negative definite, and r is greater than the largest eigenvalue of H otherwise. For the modified quadratic-hill climbing method, A is the transformation matrix for optimal search over an ellipsoidal region. The computation of Hessian and its eigenvalues and eigenvectors may be expensive.

  2. Newton method and its variations require the computation of Hessian matrix H for each iteration, a class of quasi-Newton methods using only the gradient vector are available:

    The idea is to approximate H-1 during iterations: let N be the matrix approximating H-1, Δxi = xi+1 - xi, and Δgi = gi+1 - gi, then the Broyden (rank one correction) method is to compute Ni+1 from Ni according to

    Ni+1= Ni +
    (Δxi-NiΔgi')(Δxi-NiΔgi')'

        (Δxi-NiΔgi)'Δgi'

    The formula of Ni+1 for DFP (rank two correction) method is

    Ni+1= Ni +
    ΔxiΔxi'

    Δxi'Δgi'
    -
    NiΔgi'ΔgiNi

     ΔgiNiΔgi'

    The formula of Ni+1 for BFGS (rank two correction) method is

    Ni+1= Ni
    |
    1 +
    ΔgiNiΔgi'

     Δxi'Δgi'
    ⌉ ⌈
    |   |
    ⌋ ⌊
    ΔxiΔxi'

    Δxi'Δgi'
    |
    -
    ΔxiΔgiNi+NiΔgi'Δxi'

         Δxi'Δgi'

Convergence Criteria

Since the numeric optimization methods are iterative in nature, certain convergence criteria are needed to stop the iterations when a solution is found. Given a tolerance level, for two iterations i and i+1,
  1. f(xi+1) ≥ f(xi) for maximization, or
    f(xi+1) ≤ f(xi) for minimization;
    ||f(xi+1) - f(xi)|| --> 0.
  2. ||xi+1 - xi|| --> 0 and (1).
  3. ||g(xi)||, ||g(xi+1)|| --> 0 and (1), (2).
    Note: ||g(xi)|| = g(xi)H(xi)-1g(xi)', if H(xi) is evaluated.

Example 1: Solving Mathematical Function

Example 2: Estimating Probability Distributions

The characteristics of a random variable (e.g. mean and variance, etc.) may be evaluated through the joint probability density of a finite sample. This joint density function, or the likelihood function, is defined as the product of N independent density functions f(Xi,θ) of data observations Xi (i=1,2,...,N) and a parameter vector θ. That is,

i=1,2,...,N f(Xi,θ),

or equivalently in log form:

ll(θ) = ∑i=1,2,...,N ln f(Xi,θ)

The problem is to maximize the log-likelihood function ll(θ) so that the solution θ characterizes the probability distribution of the random variable X under consideration. To find the θ that maximizes ll(θ) is the essence of maximum likelihood estimation. The corresponding variance-covariance matrix of θ is derived from the information matrix (negatives of the expected values of the second derivatives) of the log-likelihood function as follows:

Var(θ) = [-E(∂2ll/∂θ∂θ')]-1

Probability Distributions in the Exponential Family

Probability distributions in the exponential family are popular in econometrics. For continuous data, the familiar example is the likelihood function derived from a normal or log-normal probability distribution. For binary data analysis, Bernoulli or binomial distribution is used. To analyze accident data, we typically assume Poisson distribution. Useful for survival data analysis (e.g., unemployment, strike duration), the inverse transformation of the random variable and therefore the inverse probability distribution is applied.

Normal Distribution

We begin with the familiar normal probability distribution:

f(X,θ) = 1/√(2πσ2) exp [-(X-μ)2/(2σ2)]

Where θ = (μ,σ2) represents the distribution parameters. It is straightforward to show that the maximum likelihood solution is μ = E(X) = 1/N ∑i=1,...,N Xi, the sample mean; and σ2 = Var(X) = 1/N ∑i=1,...,N (Xi-μ)2, the sample variance. The corresponding log-likelihood function is:

ll(θ) = ∑i=1,2,...,N ln f(Xi,θ) = - Nln(2πσ2) - 1/2 ∑i=1,...,N(Xi-μ)22

Log-Normal Distribution

Another familiar example is based the log-normal distribution of X (or equivalently, normal distribution of ln(X)) defined as:

f(X,θ) = 1/√(2πσ2) (1/X) exp [(ln(X)-μ)2/(2σ2)]

With the solution μ = 1/N ∑i=1,...,Nln(Xi) and σ2 = 1/N ∑i=1,...,N(ln(Xi)-μ)2, the corresponding mean and variance of X is E(X) = exp(μ+σ2/2) and Var(X) = exp(2μ+σ2)[exp2)-1], respectively. Many economic variables are described with a log-normal instead of a normal probability distribution.

Gamma Distribution

Of course, maximum likelihood estimation is not limited to models with normal or log-normal probability distribution. In many situations, the probability distribution of a random variable may be non-normal or unknown. For example, to estimate the gamma distribution of a nonnegative random variable X ≥ 0, the distribution function is

f(X,θ) = λρ/Γ(ρ) e-λX X ρ-1

where θ = (λ, ρ) is the parameter vector with λ > 0 and ρ > 0. The mean of X is ρ/λ, and the variance is ρ/λ2. Many familiar distributions, such as the exponential (ρ=1) and Chi-square distributions (ρ=N/2, λ=1/2), are special cases of the gamma distribution.

As with the normal distribution, the technique of maximum likelihood can be used to estimate the parameters of the gamma distribution. Sampling from N independent observations from the gamma distribution, the log-likelihood function is:

ll(θ) = N [ρln(λ) - lnΓ(ρ)] - λ ∑i=1,2,...,N Xi + (ρ-1) ∑i=1,2,...,N ln(Xi).

Inverse Gamma Distribution

The inverse gamma distribution is the distribution of Y=1/X, where X has the gamma distribution. The inverse gamma distribution is defined by:

f(Y,θ) = λρ/Γ(ρ) e-λ/Y Y-(ρ+1)

where θ = (λ, ρ) is the parameter vector with ρ > 0 and λ >2.

Inverse Normal Distribution (Wald Distribution)

The standard form of inverse normal distribution for the random variable X is:

f(X,θ) = √(λ/(2πX3)) exp [-λ(X-μ)2/(2μ2X)]

Where θ = (μ,λ) is the vector of distribution parameters for X>0, λ>0, and μ>0. The maximum likelihood estimator of μ and λ is 1/N ∑i=1,...,NXi and 1/N ∑i=1,...,N1/Xi. The mean and variance of X is E(X) = μ and Var(X) =μ3/λ, respectively.

With the normal, log-normal, and gamma (or exponential) probability distributions, the characteristics of the random variable X may be described in terms of the estimated mean and variance for each probability distribution as follows:

Normal
Distribution
Log-Normal
Distribution
Gamma
Distribution
Exponential
Distribution
Mean
E(X)
μexp(μ+σ2/2)ρ/λ1/λ
Variance
Var(X)
σ2exp(2μ+σ2)[exp2)-1]ρ/λ21/λ2
Where:μ = 1/N ∑i=1,...,NXi
σ2 = 1/N ∑i=1,...,N(Xi-μ)2
μ = 1/N ∑i=1,...,Nln(Xi)
σ2 = 1/N ∑i=1,...,N(ln(Xi)-μ)2

Based on 20 observations of a hypothetical data series INCOME of Greene [Table C.1] or YED20.TXT, estimate its mean and variance under the assumption of three probability distributions (normal, log-normal, and gamma). First we need to define and represent the log-likelihood function for each of probability distributions. Then we estimate the parameters of each probability distribution by maximizing the corresponding log-likelihood function respectively.

Example 3: Mixture of Probability Functions

It is possible that a random variable is drawn from a mixture of probability distributions (two or more, same or different types of distributions). For simple exploration, consider X is distributed with a mixture of two normal distributions:

f1(X,μ11) = 1/√(2πσ12) exp [(X-μ1)2/2σ12] and
f2(X,μ22) = 1/√(2πσ22) exp [(X-μ2)2/2σ22]

Then the likelihood function is

f(X,θ) = λ f1(X,μ11) + (1-λ) f2(X,μ22)

where λ is the probability that an observation is drawn from the first distribution f1(X,μ11), and 1-λ is the probability of that drawn from the second distribution. θ = (μ1212,λ)' is the unknown parameter vector need to be estimated.

Continue from the previous example. Suppose each observation of the variable INCOME is drawn from one of two different normal distributions. Formulate and maximize the log-likelihood function for solving the parameters of two normal distributions.

Example 4: Estimating Regression Equation

Using the above hypothetical data series INCOME and EDUCATION, we estimate the effect of EDUCATION on INCOME with the linear parametrization of the mean or μ = E(INCOME)= α + β EDUCATION. Then the linear regression equation is INCOME = α + β EDUCATION + ε, where the error term ε is assumed to follow a normal probability distribution with 0 mean and constant variance σ2.

The method of maximum likelihood estimation is to find the parameter vector θ = (α,β,σ2) so that the following log-likelihood function

ll(θ) = ∑i=1,2,...,N ln f(εi,θ) = - Nln(2πσ2) - 1/2 ∑i=1,...,Nεi22
= - Nln(2πσ2) - 1/2 ∑i=1,...,N(INCOMEi - α - β EDUCATIONi)22

is maximized.

Without assuming the normal probability distribution for the regression error, the alternative method of least squares estimation is to find the parameter vector (α,β) by minimizing the sum of squared errors

S(α,β) = ε'ε = ∑i=1,...,Nεi2
= ∑i=1,...,N(INCOMEi - α - β EDUCATIONi)2

Constrained Optimization

In many economic and econometric applications, the optimization problems are likely to be constrained. There are equality constraints and inequality constraints. Although the former in simpler situations could be substituted into the objective function through reparametrization, the later is more difficult to handle which may include a set of boundary conditions on the estimated parameters.

Equality Constraint

The equality constrained optimization problem is represented as follows:

maximize (or minimize) f(x)
subject to c(x) = 0

In terms of the Lagrangian Function defined by

L(x,λ) = f(x) - λ' c(x),

it becomes an unconstrained optimization problem for x and λ, where λ is a vector of Lagrangian Multipliers. For a binding or active constraint, the corresponding element in λ must be non-zero.

If x0 is a maximum (or minimum), from the first-order condition

∂f(x0)/∂x - λ' ∂c(x0)/∂x = 0
c(x0) = 0

and the second-order (bordered Hessian) condition is stated as:

2f(x0)/∂x'∂x is negative (or positive) definite on the subspace tangent to c(x) at x0. That is, the matrix

2f(x0)/∂x'∂x    c(x)
c(x)'    0
is negative (positive) definite.

Inequality Constraint

A typical inequality constrained optimization problem is given as:

maximize f(x)minimize f(x)
subject to c(x) ≤ 0 subject to c(x) ≥ 0

If x0 is a maximum (or minimum) and λ is the vector of Lagrangian multipliers, from the first-order condition of maximizing (or minimizing) the Lagrangian function L(x,λ) = f(x) - λ' c(x), known as the Kuhn-Tucker Conditions are:

∂f(x0)/∂x - λ' ∂c(x0)/∂x = 0
c(x0) ≤ 0 for maximization (or c(x0) ≥ 0 for minimization), and λ ≥ 0.

The last set of inequality conditions imply that λici(x0) = 0 for each constraint i. In other words, for an active constraint i, that is ci(x) = 0, it must have the corresponding Lagrangian multiplier λi > 0.

A general optimization problem may consist of both equality and inequality constraints. The same representation can be used with the vector of inequality constraints to include elements of equality. For numerical computation of the constrained solution, equality and inequality constraints and bounded set restrictions are treated separately for efficiency reason.

Special Cases

  1. Nonnegativity Constraints

    Consider a special problem of bounded minimization:

    minimize f(x) subject to x ≥ 0.

    The solution is found from solving the following first-order condition:

    ∂f(x0)/∂x ≥ 0 and x0 ≥ 0.

  2. Linear Programming

    A classical LP problem is formulated like this:

    minimize f(x) = p'x
    subject to Ax - b ≥ 0, and x ≥ 0.

    where the vectors p, b, and the matrix A are the parameters conformable with x.

  3. Quadratic Programming

    A typical QP problem is given as follows:

    minimize f(x) = p'x + ½ x'Qx
    subject to Ax - b ≥ 0, and x ≥ 0.

    where the vectors p, b, and the matrices Q, A, are the parameters conformable with x.

Using Unconstrained Optimization Methods

Using GAUSS, constrained optimization problems can be solved with the built-in function SQPSOLVE (Similarly, QNEWTON procedure is useful for solving unconstrained problems). However, only the BFGS descent algorithm is available in both GAUSS procedures.

By substituting out (linear or nonlinear) constraints, we can use unconstrained optimization methods to solve a constrained problem. For inequality constraints, parameter transformation may be used. For example, instead of estimating the constrained parameter x, we estimate the unconstrained parameter z and compute x from the continuous transformation x = φ(z) as follows:

Parameter
Constraint
Prameter
Transformation
x = φ(z), -∞ < z < ∞
x > 0x = z2 or
x = exp(z)
0 < x < 1x = exp(z)/(1+exp(z))
-1 < x < 1x = (exp(z)-1)/(1+exp(z)) = tanh(z/2)
0 < x1 + x2 < 1
x1 > 0, x2 > 0
x1 = exp(z1)/(1+exp(z1)+exp(z2))
x2 = exp(z2)/(1+exp(z1)+exp(z2))

Given the transformation function x = φ(z), the objective function f(x) is transformed to F(z) = f(φ(z)). Therefore, applying constrained optimization to f with respect to x is equivalent to applying unconstrained optimization to F with respect to z. However, the parameter of interest is x, not z. It is useful to verify the gradient and hessian of the function f with respect to x.

From the unconstrained solution z of F, the associated gradient and hessian are:

∂F/∂z = (∂f/∂x)(∂φ/∂z) = 0
2F/∂z∂z' = (∂f/∂x) [∂2φ/∂z∂z'] + (∂φ/∂z)' [∂2f/∂x∂x'] (∂φ/∂z)

Therefore,

∂f/∂x = (∂F/∂z) (∂φ/∂z)-1 = 0, because ∂F/∂z = 0
2f/∂x∂x' = (∂φ/∂z)'-1 [∂2F/∂z∂z'] (∂φ/∂z)-1

Note that the transformation function φ: z -> x may be vector-valued, then the gradient ∂φ/∂z is a matrix. The inverse of the gradient matrix is used to transform the hessians.

Example 3 Revisited: Mixture of Probability Distributions

The mixture probability λ must satisfy the condition that 0 < λ < 1. To impose the unit interval restriction on λ, we could apply the transformation such as: where -∞ < θ < ∞ is unrestricted. Since our interest is the probability parameter λ, the estimates of parameter variances should be obtained for λ.


Copyright© Kuan-Pin Lin
Last updated: 09/28/2012