Topic 2

Nonlinear Optimization

Introduction
- Scalar-Valued Function of One or Two Variables
- Taylor Approximation: A Review
Unconstrained Optimization
- Numerical Computation Methods
  - Step Size and Direction
  - Optimization Methods
  - Convergence Criteria
- Examples
  - Solving Mathematical Function
  - Estimating Probability Distributions
  - Mixture of Probability Distributions
  - Minimizing Sum-of-Squares Function
Constrained Optimization
- Nonnegativity Constraints
- Linear Constraints
- Nonlinear Constraints
- Using Unconstrained Optimization Methods

Readings and References:

MATA Reference Manual [M-5], Optimize (Stata->help mata optimize)
W. H. Greene, Econometric Analysis, 7th Ed., Appendix E: Computation and Optimization, Prentice-Hall, 2011.
G. G. Judge, W. E. Griffiths, R. C. Hill, H. Lutkepohl, T.-C. Lee, The Theory and Practice of Econometrics, 2nd Ed., Appendix B: Numerical Optimization Methods, John Wiley, 1988, 951-979.
R. E. Quandt, "Computational Problems and Methods", Handbook of Econometrics, Vol. 1, Chapter 12, ed. by Z. Griliches and M. D. Intriligator, North-Holland, 1983, 699-764 (Paper).
S. M. Goldfeld and R. E. Quandt, "Numerical Optimization", Nonlinear Methods in Econometrics, Chapter 1, North-Holland, 1972.

Introduction

Many economic and econometric problems can be formulated as optimization (minimization and maximization) problems with or without constraints. In most cases, simple equality constraints are substituted into the objective function so that the problems are essentially the unconstrained optimization. Inequality constraints are possible and will make the problem more difficult to solve. Using MATA in Stata, optimization with linear constraints is implemented but not for the problems with nonlinear constraints.

Scalar-Valued Function of One or Two Variables

Representing the function and evaluating its first and second derivatives
Example: Using MATA Function deriv()
1. f(x) = ln(x) - x²
2. g(x) = (x₁² + x₂ - 11)² + (x₁ + x₂² -7)²

Taylor Approximation: A Review

Given a n-variable scalar-valued function f: Rⁿ -> R, the 1xn gradient vector of f (1st derivatives) evaluated at x⁰ is g(x⁰), and the corresponding nxn Hessian matrix of f (2nd derivatives) is H(x⁰). The first-order (linear) Taylor approximation of f at x⁰ is

f(x) = f(x⁰) + g(x⁰) (x-x⁰)

Similarly, the second-order (quadratic) Taylor approximation of f at x⁰ is

f(x) = f(x⁰) + g(x⁰) (x-x⁰) + ½ (x-x⁰)' H(x⁰) (x-x⁰)

Unconstrained Optimization

A typical unconstrained optimization problem is solved as follows:

max f(x) min f(x)

x⁰ is the maximum: x⁰ is the minimum:

f(x⁰) ≥ f(x) for all x
g(x⁰) = 0
H(x⁰) is negative definite f(x⁰) ≤ f(x) for all x
g(x⁰) = 0
H(x⁰) is positive definite

max f(x)	min f(x)
x⁰ is the maximum:	x⁰ is the minimum:
f(x⁰) ≥ f(x) for all x g(x⁰) = 0 H(x⁰) is negative definite	f(x⁰) ≤ f(x) for all x g(x⁰) = 0 H(x⁰) is positive definite

From the first-order approximation of f at x⁰, we have

f(x) = f(x⁰) + g(x⁰) (x-x⁰), or

f(x⁰+Δx) - f(x⁰) = g(x⁰) Δx,

where Δx = x-x⁰. Δx is called the step of optimization.

If x⁰ is not the optimum (maximum or minimum), then the function value may be improved by taking the step Δx in accordance with the direction of gradient g(x⁰).

max f(x) min f(x)

find Δx such that f(x⁰+Δx) - f(x⁰) > 0: find Δx such that f(x⁰+Δx) - f(x⁰) < 0:

from the 1st-order approximation,
g(x⁰) Δx > 0, i.e.
if g(x⁰) > 0, Δx > 0;
if g(x⁰) < 0, Δx < 0 from the 1st-order approximation,
g(x⁰) Δx < 0, i.e.
if g(x⁰) > 0, Δx < 0;
if g(x⁰) < 0, Δx > 0

max f(x)	min f(x)
find Δx such that f(x⁰+Δx) - f(x⁰) > 0:	find Δx such that f(x⁰+Δx) - f(x⁰) < 0:
from the 1st-order approximation, g(x⁰) Δx > 0, i.e. if g(x⁰) > 0, Δx > 0; if g(x⁰) < 0, Δx < 0	from the 1st-order approximation, g(x⁰) Δx < 0, i.e. if g(x⁰) > 0, Δx < 0; if g(x⁰) < 0, Δx > 0

Cauchy Step

Setting Δx = g(x⁰)' for maximization or Δx = -g(x⁰)' for minimization is the so-called Cauchy Step.

Newton Step

If f(x) is approximated at x⁰ by the second order, f(x) = f(x⁰) + g(x⁰) (x-x⁰) + ½ (x-x⁰)' H(x⁰) (x-x⁰). Suppose x is the optimum (maximum or minimum), from the first-order condition of optimization: ∂f(x)/∂x = 0. That is,

g(x⁰)+Δx' H(x⁰) = 0, or Δx = -H(x⁰)^-1 g(x⁰)'.

Here the optimal step = -H(x⁰)^-1 g(x⁰)' is the so-called Newton Step. Note that H(x⁰) is a negative definite (positive definite) matrix when x⁰ is near the maximum (minimum) x.

Numerical Computation Method

Step Size and Direction

Finding the solution x is essentially the task of searching for the optimal step Δx so that the first and second order conditions of function optimization are satisfied. Let Δx = s d: s is the step size (>0), and d is the step direction. Various methods of optimization are available depending on the choice of s and d. In general, the direction d is determined according to the gradient vector g':

max f(x) min f(x)

Δx = s d Δ = s d

s > 0
d = M g'
M is positive definite s > 0
d = -M g'
M is positive definite

max f(x)	min f(x)
Δx = s d	Δ = s d
s > 0 d = M g' M is positive definite	s > 0 d = -M g' M is positive definite

Cauchy type direction is obtained by setting M = I, the identity matrix. Newton type direction is obtained by setting M = -H^-1. Note that the Hessian H is negative definite for maximization and positive definite for minimization.

In searching for an optimum, the following typical iterative procedure is used:

for the i+1-th iteration, xⁱ⁺¹ = xⁱ + sⁱ dⁱ, where dⁱ = ± Mⁱ gⁱ' with Mⁱ being positive definite.

Given the initial value x⁰ and step direction d⁰, determine a step size s⁰. Then compute x¹ = x⁰ + s⁰ d⁰. Here d⁰ is method dependent. The iteration looks like this:

specify         compute             check      (yes)
x⁰ ---> s⁰ ---> x¹ = x⁰ + s⁰ d⁰ ---> convergence ---> stop
      |              |                             | (no)
      |              |                             | 
      d⁰           s¹ <------------------- compute d¹

How a step size, s⁰ or s¹ above, is determined?

Given x⁰ and d⁰, find s>0 so that f(x⁰+s d⁰) is maximized (or minimized). This is an one-dimensional line search problem.

What is the optimal size of a step (s>0)?

From the second-order approximation:

f(x) = f(x⁰) + g(x⁰) (x-x⁰) + ½ (x-x⁰)' H(x⁰) (x-x⁰) or

f(x⁰+s d) = f(x⁰) + g(x⁰) s d + ½ s² d' H(x⁰) d.

Maximize (or minimize) the above function with respect to s, the first-order condition is

g(x⁰) d + s d' H(x⁰) d = 0.

Then the optimal step size is s = -(g(x⁰) d)/(d' H(x⁰) d). For Cauchy type direction, the optimal step size is

s = -(g g')/(g H g').

For Newton type direction, the optimal step size is exactly 1.

In practice, the optimal step size is rarely useful. Here a simple bi-directional linear search method is considered instead:

Given the step direction d and a specified step size s, that is x = x⁰ + s d. If the function value f(x) improves over the previous f(x⁰), the step size in use could be enlarged:
Repeat s = 1.1 s until no further improvement on f(x).
On the other hand, f(x) could be worse than f(x⁰). Then the step size in use should be contracted:
Repeat s = s/2 until an improvement of f(x) is found. If f(x) can not be improved during this contraction process, f(x) = f(x⁰) is already an optimum or a saddle point.

(i) and (ii) may be repeated and iterated. Alternative methods are available for searching the proper step size. For example, see Berndt, Hall, Hall, and Hausman [1974].

Optimization Methods

Since dⁱ = ± Mⁱ gⁱ', the positive definite matrix Mⁱ is determined in terms of different optimization methods used:

Optimization Method Matrix M

Steepest ascent (descent) method I

BFGS method see (2) below

DFP method see (2) below

Newton-Raphson method -H^-1

Greenstadt method (∑(|wⁱ| vⁱ vⁱ'))^-1 or
∑(|1/wⁱ| vⁱ vⁱ'), see (1) below

Quadratic-hill climbing -(H-r I)^-1, see (1) below

Modified quadratic-hill climbing -(H-r A)^-1, see(1) below

Optimization Method	Matrix M
Steepest ascent (descent) method	I
BFGS method	see (2) below
DFP method	see (2) below
Newton-Raphson method	-H^-1
Greenstadt method	(∑(\|wⁱ\| vⁱ vⁱ'))^-1 or ∑(\|1/wⁱ\| vⁱ vⁱ'), see (1) below
Quadratic-hill climbing	-(H-r I)^-1, see (1) below
Modified quadratic-hill climbing	-(H-r A)^-1, see(1) below

Explanation Notes:

Since d = ± M g' with M positive definite at optimum. However, for some iteration i, Mⁱ may not be positive definite. Various methods have been designed to deal with non-positive definiteness of Mⁱ. This includes but not limited to
- Greenstadt method
- Quadratic-hill climbing method
- Modified quadratic-hill climbing method
For the Greenstadt method, let wⁱ and vⁱ be the i-th eigenvalue and its corresponding eigenvector of a positive definite matrix. With all wⁱ > 0, the matrix is ∑ wⁱ vⁱ vⁱ'. Its inverse is ∑ (1/wⁱ) vⁱ vⁱ' which is also positive definite. If some of wⁱ is negative, use |wⁱ| to replace wⁱ.
For the quadratic-hill climbing methods, r ≥ 0 forces H - rI or H - rA to be negative definite (for maximization). Indeed, r = 0 if H is already negative definite, and r is greater than the largest eigenvalue of H otherwise. For the modified quadratic-hill climbing method, A is the transformation matrix for optimal search over an ellipsoidal region. The computation of Hessian and its eigenvalues and eigenvectors may be expensive.
Newton method and its variations require the computation of Hessian matrix H for each iteration, a class of quasi-Newton methods using only the gradient vector are available:
- Broyden method
- Davidon-Fletcher-Powell (DFP) method
- Broyden-Fletcher-Goldfarb-Shanon (BFGS) method
The idea is to approximate H^-1 during iterations: let N be the matrix approximating H^-1, Δxⁱ = xⁱ⁺¹ - xⁱ, and Δgⁱ = gⁱ⁺¹ - gⁱ, then the Broyden (rank one correction) method is to compute Nⁱ⁺¹ from Nⁱ according to

Nⁱ⁺¹ = Nⁱ +

(Δxⁱ-NⁱΔgⁱ')(Δxⁱ-NⁱΔgⁱ')'

(Δxⁱ-NⁱΔgⁱ)'Δgⁱ'

The formula of Nⁱ⁺¹ for DFP (rank two correction) method is

Nⁱ⁺¹ = Nⁱ +

ΔxⁱΔxⁱ'

Δxⁱ'Δgⁱ'

-

NⁱΔgⁱ'ΔgⁱNⁱ

ΔgⁱNⁱΔgⁱ'

The formula of Nⁱ⁺¹ for BFGS (rank two correction) method is

Nⁱ⁺¹ = Nⁱ

⌈

|

⌊

1 +

ΔgⁱNⁱΔgⁱ'

Δxⁱ'Δgⁱ'

⌉ ⌈

| |

⌋ ⌊

ΔxⁱΔxⁱ'

Δxⁱ'Δgⁱ'

⌉

|

⌋

-

ΔxⁱΔgⁱNⁱ+NⁱΔgⁱ'Δxⁱ'

Δxⁱ'Δgⁱ'

Convergence Criteria

Since the numeric optimization methods are iterative in nature, certain convergence criteria are needed to stop the iterations when a solution is found. Given a tolerance level, for two iterations i and i+1,

f(xⁱ⁺¹) ≥ f(xⁱ) for maximization, or
f(xⁱ⁺¹) ≤ f(xⁱ) for minimization;
||f(xⁱ⁺¹) - f(xⁱ)|| --> 0.
||xⁱ⁺¹ - xⁱ|| --> 0 and (1).
||g(xⁱ)||, ||g(xⁱ⁺¹)|| --> 0 and (1), (2).
Note: ||g(xⁱ)|| = g(xⁱ)H(xⁱ)^-1g(xⁱ)^', if H(xⁱ) is evaluated.

Example 1: Solving Mathematical Function

Using MATA Function optimize() to solve the following functions:
1. f(x) = ln(x) - x²
  Hint: The maximal of f(x) is found at x = Ö½ or 0.707.
2. g(x) = (x₁² + x₂ - 11)² + (x₁ + x₂² -7)²
  Hint: There are four minima: (3, 2), (3.5844, -1.8481), (-3.7793, -3.2832), (-2.8051, 3.1313).

Example 2: Estimating Probability Distributions

The characteristics of a random variable (e.g. mean and variance, etc.) may be evaluated through the joint probability density of a finite sample. This joint density function, or the likelihood function, is defined as the product of N independent density functions f(X_i,θ) of data observations X_i (i=1,2,...,N) and a parameter vector θ. That is,

∏_i=1,2,...,N f(X_i,θ),

or equivalently in log form:

ll(θ) = ∑_i=1,2,...,N ln f(X_i,θ)

The problem is to maximize the log-likelihood function ll(θ) so that the solution θ characterizes the probability distribution of the random variable X under consideration. To find the θ that maximizes ll(θ) is the essence of maximum likelihood estimation. The corresponding variance-covariance matrix of θ is derived from the information matrix (negatives of the expected values of the second derivatives) of the log-likelihood function as follows:

Var(θ) = [-E(∂²ll/∂θ∂θ')]^-1

Probability Distributions in the Exponential Family

Probability distributions in the exponential family are popular in econometrics. For continuous data, the familiar example is the likelihood function derived from a normal or log-normal probability distribution. For binary data analysis, Bernoulli or binomial distribution is used. To analyze accident data, we typically assume Poisson distribution. Useful for survival data analysis (e.g., unemployment, strike duration), the inverse transformation of the random variable and therefore the inverse probability distribution is applied.

Normal Distribution

We begin with the familiar normal probability distribution:

f(X,θ) = 1/√(2πσ²) exp [-(X-μ)²/(2σ²)]

Where θ = (μ,σ²) represents the distribution parameters. It is straightforward to show that the maximum likelihood solution is μ = E(X) = 1/N ∑_i=1,...,N X_i, the sample mean; and σ² = Var(X) = 1/N ∑_i=1,...,N (X_i-μ)², the sample variance. The corresponding log-likelihood function is:

ll(θ) = ∑_i=1,2,...,N ln f(X_i,θ) = - Nln(2πσ²) - 1/2 ∑_i=1,...,N(X_i-μ)²/σ²

Log-Normal Distribution

Another familiar example is based the log-normal distribution of X (or equivalently, normal distribution of ln(X)) defined as:

f(X,θ) = 1/√(2πσ²) (1/X) exp [(ln(X)-μ)²/(2σ²)]

With the solution μ = 1/N ∑_i=1,...,Nln(X_i) and σ² = 1/N ∑_i=1,...,N(ln(X_i)-μ)², the corresponding mean and variance of X is E(X) = exp(μ+σ²/2) and Var(X) = exp(2μ+σ²)[exp(σ²)-1], respectively. Many economic variables are described with a log-normal instead of a normal probability distribution.

Gamma Distribution

Of course, maximum likelihood estimation is not limited to models with normal or log-normal probability distribution. In many situations, the probability distribution of a random variable may be non-normal or unknown. For example, to estimate the gamma distribution of a nonnegative random variable X ≥ 0, the distribution function is

f(X,θ) = λ^ρ/Γ(ρ) e^-λX X ^ρ-1

where θ = (λ, ρ) is the parameter vector with λ > 0 and ρ > 0. The mean of X is ρ/λ, and the variance is ρ/λ². Many familiar distributions, such as the exponential (ρ=1) and Chi-square distributions (ρ=N/2, λ=1/2), are special cases of the gamma distribution.

As with the normal distribution, the technique of maximum likelihood can be used to estimate the parameters of the gamma distribution. Sampling from N independent observations from the gamma distribution, the log-likelihood function is:

ll(θ) = N [ρln(λ) - lnΓ(ρ)] - λ ∑_i=1,2,...,N X_i + (ρ-1) ∑_i=1,2,...,N ln(X_i).

Inverse Gamma Distribution

The inverse gamma distribution is the distribution of Y=1/X, where X has the gamma distribution. The inverse gamma distribution is defined by:

f(Y,θ) = λ^ρ/Γ(ρ) e^-λ/Y Y^-(ρ+1)

where θ = (λ, ρ) is the parameter vector with ρ > 0 and λ >2.

Inverse Normal Distribution (Wald Distribution)

The standard form of inverse normal distribution for the random variable X is:

f(X,θ) = √(λ/(2πX³)) exp [-λ(X-μ)²/(2μ²X)]

Where θ = (μ,λ) is the vector of distribution parameters for X>0, λ>0, and μ>0. The maximum likelihood estimator of μ and λ is 1/N ∑_i=1,...,NX_i and 1/N ∑_i=1,...,N1/X_i. The mean and variance of X is E(X) = μ and Var(X) =μ³/λ, respectively.

With the normal, log-normal, and gamma (or exponential) probability distributions, the characteristics of the random variable X may be described in terms of the estimated mean and variance for each probability distribution as follows:

Normal
Distribution Log-Normal
Distribution Gamma
Distribution Exponential
Distribution

Mean
E(X) μ exp(μ+σ²/2) ρ/λ 1/λ

Variance
Var(X) σ² exp(2μ+σ²)[exp(σ²)-1] ρ/λ² 1/λ²

Where: μ = 1/N ∑_i=1,...,NX_i
σ² = 1/N ∑_i=1,...,N(X_i-μ)² μ = 1/N ∑_i=1,...,Nln(X_i)
σ² = 1/N ∑_i=1,...,N(ln(X_i)-μ)²

Based on 20 observations of a hypothetical data series INCOME of Greene [Table C.1] or YED20.TXT, estimate its mean and variance under the assumption of three probability distributions (normal, log-normal, and gamma). First we need to define and represent the log-likelihood function for each of probability distributions. Then we estimate the parameters of each probability distribution by maximizing the corresponding log-likelihood function respectively.

Example 3: Mixture of Probability Functions

It is possible that a random variable is drawn from a mixture of probability distributions (two or more, same or different types of distributions). For simple exploration, consider X is distributed with a mixture of two normal distributions:

f₁(X,μ₁,σ₁) = 1/√(2πσ₁²) exp [(X-μ₁)²/2σ₁²] and
f₂(X,μ₂,σ₂) = 1/√(2πσ₂²) exp [(X-μ₂)²/2σ₂²]

Then the likelihood function is

f(X,θ) = λ f₁(X,μ₁,σ₁) + (1-λ) f₂(X,μ₂,σ₂)

where λ is the probability that an observation is drawn from the first distribution f₁(X,μ₁,σ₁), and 1-λ is the probability of that drawn from the second distribution. θ = (μ₁,μ₂,σ₁,σ₂,λ)' is the unknown parameter vector need to be estimated.

Continue from the previous example. Suppose each observation of the variable INCOME is drawn from one of two different normal distributions. Formulate and maximize the log-likelihood function for solving the parameters of two normal distributions.

Example 4: Estimating Regression Equation

Using the above hypothetical data series INCOME and EDUCATION, we estimate the effect of EDUCATION on INCOME with the linear parametrization of the mean or μ = E(INCOME)= α + β EDUCATION. Then the linear regression equation is INCOME = α + β EDUCATION + ε, where the error term ε is assumed to follow a normal probability distribution with 0 mean and constant variance σ².

The method of maximum likelihood estimation is to find the parameter vector θ = (α,β,σ²) so that the following log-likelihood function

ll(θ) = ∑_i=1,2,...,N ln f(ε_i,θ) = - Nln(2πσ²) - 1/2 ∑_i=1,...,Nε_i²/σ²

= - Nln(2πσ²) - 1/2 ∑_i=1,...,N(INCOME_i - α - β EDUCATION_i)²/σ²

is maximized.

Without assuming the normal probability distribution for the regression error, the alternative method of least squares estimation is to find the parameter vector (α,β) by minimizing the sum of squared errors

S(α,β) = ε'ε = ∑_i=1,...,Nε_i²

= ∑_i=1,...,N(INCOME_i - α - β EDUCATION_i)²

Constrained Optimization

In many economic and econometric applications, the optimization problems are likely to be constrained. There are equality constraints and inequality constraints. Although the former in simpler situations could be substituted into the objective function through reparametrization, the later is more difficult to handle which may include a set of boundary conditions on the estimated parameters.

Equality Constraint

The equality constrained optimization problem is represented as follows:

maximize (or minimize) f(x)
subject to c(x) = 0

In terms of the Lagrangian Function defined by

L(x,λ) = f(x) - λ' c(x),

it becomes an unconstrained optimization problem for x and λ, where λ is a vector of Lagrangian Multipliers. For a binding or active constraint, the corresponding element in λ must be non-zero.

If x⁰ is a maximum (or minimum), from the first-order condition

∂f(x⁰)/∂x - λ' ∂c(x⁰)/∂x = 0
c(x⁰) = 0

and the second-order (bordered Hessian) condition is stated as:

∂²f(x⁰)/∂x'∂x is negative (or positive) definite on the subspace tangent to c(x) at x⁰. That is, the matrix

⌈ ∂²f(x⁰)/∂x'∂x c(x) ⌉

⌊ c(x)' 0 ⌋

is negative (positive) definite.

Inequality Constraint

A typical inequality constrained optimization problem is given as:

maximize f(x) minimize f(x)

subject to c(x) ≤ 0 subject to c(x) ≥ 0

If x⁰ is a maximum (or minimum) and λ is the vector of Lagrangian multipliers, from the first-order condition of maximizing (or minimizing) the Lagrangian function L(x,λ) = f(x) - λ' c(x), known as the Kuhn-Tucker Conditions are:

∂f(x⁰)/∂x - λ' ∂c(x⁰)/∂x = 0
c(x⁰) ≤ 0 for maximization (or c(x⁰) ≥ 0 for minimization), and λ ≥ 0.

The last set of inequality conditions imply that λ_ic_i(x⁰) = 0 for each constraint i. In other words, for an active constraint i, that is c_i(x) = 0, it must have the corresponding Lagrangian multiplier λ_i > 0.

A general optimization problem may consist of both equality and inequality constraints. The same representation can be used with the vector of inequality constraints to include elements of equality. For numerical computation of the constrained solution, equality and inequality constraints and bounded set restrictions are treated separately for efficiency reason.

Special Cases

Nonnegativity Constraints
Consider a special problem of bounded minimization:
minimize f(x) subject to x ≥ 0.
The solution is found from solving the following first-order condition:
∂f(x⁰)/∂x ≥ 0 and x⁰ ≥ 0.
Linear Programming
A classical LP problem is formulated like this:
minimize f(x) = p'x
subject to Ax - b ≥ 0, and x ≥ 0.
where the vectors p, b, and the matrix A are the parameters conformable with x.
Quadratic Programming
A typical QP problem is given as follows:
minimize f(x) = p'x + ½ x'Qx
subject to Ax - b ≥ 0, and x ≥ 0.
where the vectors p, b, and the matrices Q, A, are the parameters conformable with x.

Using Unconstrained Optimization Methods

Using GAUSS, constrained optimization problems can be solved with the built-in function SQPSOLVE (Similarly, QNEWTON procedure is useful for solving unconstrained problems). However, only the BFGS descent algorithm is available in both GAUSS procedures.

By substituting out (linear or nonlinear) constraints, we can use unconstrained optimization methods to solve a constrained problem. For inequality constraints, parameter transformation may be used. For example, instead of estimating the constrained parameter x, we estimate the unconstrained parameter z and compute x from the continuous transformation x = φ(z) as follows:

Parameter
Constraint Prameter
Transformation
x = φ(z), -∞ < z < ∞

x > 0 x = z² or
x = exp(z)

0 < x < 1 x = exp(z)/(1+exp(z))

-1 < x < 1 x = (exp(z)-1)/(1+exp(z)) = tanh(z/2)

0 < x₁ + x₂ < 1
x₁ > 0, x₂ > 0 x₁ = exp(z₁)/(1+exp(z₁)+exp(z₂))
x₂ = exp(z₂)/(1+exp(z₁)+exp(z₂))

Given the transformation function x = φ(z), the objective function f(x) is transformed to F(z) = f(φ(z)). Therefore, applying constrained optimization to f with respect to x is equivalent to applying unconstrained optimization to F with respect to z. However, the parameter of interest is x, not z. It is useful to verify the gradient and hessian of the function f with respect to x.

From the unconstrained solution z of F, the associated gradient and hessian are:

∂F/∂z = (∂f/∂x)(∂φ/∂z) = 0
∂²F/∂z∂z' = (∂f/∂x) [∂²φ/∂z∂z'] + (∂φ/∂z)' [∂²f/∂x∂x'] (∂φ/∂z)

Therefore,

∂f/∂x = (∂F/∂z) (∂φ/∂z)^-1 = 0, because ∂F/∂z = 0
∂²f/∂x∂x' = (∂φ/∂z)'^-1 [∂²F/∂z∂z'] (∂φ/∂z)^-1

Note that the transformation function φ: z -> x may be vector-valued, then the gradient ∂φ/∂z is a matrix. The inverse of the gradient matrix is used to transform the hessians.

Example 3 Revisited: Mixture of Probability Distributions

The mixture probability λ must satisfy the condition that 0 < λ < 1. To impose the unit interval restriction on λ, we could apply the transformation such as:

λ = φ(θ) = ∫_-∞^θ 1/√(2π)exp(-x²/2)dx
λ = φ(θ) = exp(θ)/(1+exp(θ))

where -∞ < θ < ∞ is unrestricted. Since our interest is the probability parameter λ, the estimates of parameter variances should be obtained for λ.

	Normal Distribution	Log-Normal Distribution	Gamma Distribution	Exponential Distribution
Mean E(X)	μ	exp(μ+σ²/2)	ρ/λ	1/λ
Variance Var(X)	σ²	exp(2μ+σ²)[exp(σ²)-1]	ρ/λ²	1/λ²
Where:	μ = 1/N ∑_i=1,...,NX_i σ² = 1/N ∑_i=1,...,N(X_i-μ)²	μ = 1/N ∑_i=1,...,Nln(X_i) σ² = 1/N ∑_i=1,...,N(ln(X_i)-μ)²

ll(θ) = ∑_i=1,2,...,N ln f(ε_i,θ)	= - Nln(2πσ²) - 1/2 ∑_i=1,...,Nε_i²/σ²
	= - Nln(2πσ²) - 1/2 ∑_i=1,...,N(INCOME_i - α - β EDUCATION_i)²/σ²

S(α,β) = ε'ε	= ∑_i=1,...,Nε_i²
	= ∑_i=1,...,N(INCOME_i - α - β EDUCATION_i)²

maximize f(x)	minimize f(x)
subject to c(x) ≤ 0	subject to c(x) ≥ 0

Parameter Constraint	Prameter Transformation x = φ(z), -∞ < z < ∞
x > 0	x = z² or x = exp(z)
0 < x < 1	x = exp(z)/(1+exp(z))
-1 < x < 1	x = (exp(z)-1)/(1+exp(z)) = tanh(z/2)
0 < x₁ + x₂ < 1 x₁ > 0, x₂ > 0	x₁ = exp(z₁)/(1+exp(z₁)+exp(z₂)) x₂ = exp(z₂)/(1+exp(z₁)+exp(z₂))