[R] optim or nlminb for minimization, which to believe?

Tue Dec 1 03:42:30 CET 2009

Hi Harold,

Hans is right.  You can see this if you use BB::spg

require(BB)

opt2 <- spg(startVal, fn)  # this is fine

opt3 <- spg(startVal, fn,gradient)  # this is not fine!

> opt3 <- spg(startVal, fn, gradient) 
Gradient check details:  max. relative difference in gradients=  0.001697913 

  analytic gradient: 0.1724284 -0.3382045 0.1724284 -0.3382045 -0.4902559 

  numerical gradient: 0.2826690 -0.2569404 0.2826690 -0.2569404 -0.4223231
Error in spg(startVal, fn, gradient) : 
  Analytic gradient does not seem correct! See comparison above. Fix it, remove it, or increase checkGrad.tol.
>

Ravi.

____________________________________________________________________

Ravi Varadhan, Ph.D.
Assistant Professor,
Division of Geriatric Medicine and Gerontology
School of Medicine
Johns Hopkins University

Ph. (410) 502-2619
email: rvaradhan at jhmi.edu

----- Original Message -----
From: Hans W Borchers <hwborchers at googlemail.com>
Date: Sunday, November 29, 2009 6:18 pm
Subject: Re: [R] optim or nlminb for minimization, which to believe?
To: r-help at r-project.org

> Your function named 'gradient' is not the correct gradient. Take as an
> example the following point x0, very near to the true minimum,
> 
>     x0 <- c(-0.2517964, 0.4898680, -0.2517962, 0.4898681, 0.7500995)
>  
> then you get
> 
>     > gradient(x0)
>     [1] -0.0372110470  0.0001816991 -0.0372102284  0.0001820976 
> 0.0144292657
> 
> but the numerical gradient is different:
> 
>     > library(numDeriv)
>     > grad(fn, x0)
>     [1] -6.151645e-07 -5.507219e-07  1.969143e-07 -1.563892e-07
> -4.955502e-08
> 
> that is the derivative is close to 0 in any direction -- as it should 
> be for
> an optimum.
> 
> No wonder, optim et al. get confused when applying this 'gradient'.
> 
> Regards
> Hans Werner
> 
> 
> Doran, Harold wrote:
> > 
> > I have constructed the function mml2 (below) based on the likelihood
> > function described in the minimal latex I have pasted below for 
> anyone who
> > wants to look at it. This function finds parameter estimates for a basic
> > Rasch (IRT) model. Using the function without the gradient, using either
> > nlminb or optim returns the correct parameter estimates and, in the 
> case
> > of optim, the correct standard errors.
> > 
> > By correct, I mean they match another software program as well as the
> > rasch() function in the ltm package.
> > 
> > Now, when I pass the gradient to optim, I get a message of successful
> > convergence, but the parameter estimates are not correct, but they are
> > *very* close to being correct. But, when I use nlminb with the 
> gradient, I
> > get a message of false convergence and again, the estimates are off, 
> but
> > again very close to being correct.
> > 
> > This is best illustrated via the examples:
> > 
> > ### Sample data set
> > set.seed(1234)
> > tmp <- data.frame(item1 = sample(c(0,1), 20, replace=TRUE), item2 =
> > sample(c(0,1), 20, replace=TRUE), item3 = sample(c(0,1), 20,
> > replace=TRUE),item4 = sample(c(0,1), 20, replace=TRUE),item5 =
> > sample(c(0,1), 20, replace=TRUE))
> > 
> > ## Use function mml2 (below) with optim  with use of gradient
> >> mml2(tmp,Q=10)
> > $par
> > [1] -0.2438733  0.4889333 -0.2438733  0.4889333  0.7464162
> > 
> > $value
> > [1] 63.86376
> > 
> > $counts
> > function gradient
> >       45        6
> > 
> > $convergence
> > [1] 0
> > 
> > $message
> > NULL
> > 
> > $hessian
> >          [,1]     [,2]     [,3]     [,4]     [,5]
> > [1,] 4.095479 0.000000 0.000000 0.000000 0.000000
> > [2,] 0.000000 3.986293 0.000000 0.000000 0.000000
> > [3,] 0.000000 0.000000 4.095479 0.000000 0.000000
> > [4,] 0.000000 0.000000 0.000000 3.986293 0.000000
> > [5,] 0.000000 0.000000 0.000000 0.000000 3.800898
> > 
> > ## Use same function but use nlminb with use of gradient
> >> mml2(tmp,Q=10)
> > $par
> > [1] -0.2456398  0.4948889 -0.2456398  0.4948889  0.7516308
> > 
> > $objective
> > [1] 63.86364
> > 
> > $convergence
> > [1] 1
> > 
> > $message
> > [1] "false convergence (8)"
> > 
> > $iterations
> > [1] 4
> > 
> > $evaluations
> > function gradient
> >       41        4
> > 
> > 
> > ### use nlminb but turn off use of gradient
> >> mml2(tmp,Q=10)
> > $par
> > [1] -0.2517961  0.4898682 -0.2517961  0.4898682  0.7500994
> > 
> > $objective
> > [1] 63.8635
> > 
> > $convergence
> > [1] 0
> > 
> > $message
> > [1] "relative convergence (4)"
> > 
> > $iterations
> > [1] 8
> > 
> > $evaluations
> > function gradient
> >       11       64
> > 
> > ### Use optim and turn off gradient
> > 
> >> mml2(tmp,Q=10)
> > $par
> > [1] -0.2517990  0.4898676 -0.2517990  0.4898676  0.7500906
> > 
> > $value
> > [1] 63.8635
> > 
> > $counts
> > function gradient
> >       22        7
> > 
> > $convergence
> > [1] 0
> > 
> > $message
> > NULL
> > 
> > $hessian
> >            [,1]       [,2]       [,3]       [,4]       [,5]
> > [1,]  3.6311153 -0.3992959 -0.4224747 -0.3992959 -0.3764526
> > [2,] -0.3992959  3.5338195 -0.3992959 -0.3960956 -0.3798141
> > [3,] -0.4224747 -0.3992959  3.6311153 -0.3992959 -0.3764526
> > [4,] -0.3992959 -0.3960956 -0.3992959  3.5338195 -0.3798141
> > [5,] -0.3764526 -0.3798141 -0.3764526 -0.3798141  3.3784816
> > 
> > The parameter estimates with and without the use of the gradient are 
> so
> > close that I am inclined to believe that the gradient is correct and 
> maybe
> > the problem is elsewhere.
> > 
> > It seems odd that optim seems to converge but nlminb does not with 
> the use
> > of the gradient. But, with the use of the gradient in either case, the
> > parameter estimates differ from what I think are the correct values. 
> So,
> > at this point I am unclear if the problem is somewhere in the way the
> > functions are used or how I am passing the gradient or if the 
> problem lies
> > in the way I have constructed the gradient itself.
> > 
> > Below is the function and also some latex for those interested in looking
> > at the likelihood function.
> > 
> > Thanks for any reactions
> > Harold
> > 
> >> sessionInfo()
> > R version 2.10.0 (2009-10-26)
> > i386-pc-mingw32
> > 
> > locale:
> > [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
> > States.1252
> > [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> > [5] LC_TIME=English_United States.1252
> > 
> > attached base packages:
> > [1] stats     graphics  grDevices utils     datasets  methods   base
> > 
> > other attached packages:
> > [1] ltm_0.9-2      polycor_0.7-7  sfsmisc_1.0-9  mvtnorm_0.9-8  
> msm_0.9.4     
> > MASS_7.3-3     MiscPsycho_1.5
> > [8] statmod_1.4.1
> > 
> > loaded via a namespace (and not attached):
> > [1] splines_2.10.0  survival_2.35-7 tools_2.10.0
> > 
> > 
> > mml2 <- function(data, Q, startVal = NULL, gr = TRUE, ...){
> >                 if(!is.null(startVal) && length(startVal) != 
> ncol(data) ){
> >                                 stop("Length of argument startVal not
> > equal to the number of parameters estimated")
> >                 }
> >                 if(!is.null(startVal)){
> >                                 startVal <- startVal
> >                                 } else {
> >                                 p <- colMeans(data)
> >                                 startVal <- as.vector(log((1 - p)/p))
> >                 }
> >                 qq <- gauss.quad.prob(Q, dist = 'normal')
> >                 rr1 <- matrix(0, nrow = Q, ncol = nrow(data))
> >                 data <- as.matrix(data)
> >                 L <- nrow(data)
> >                 C <- ncol(data)
> >                 fn <- function(b){
> >                                 for(j in 1:Q){
> >                                                 for(i in 1:nrow(data)){
> >                                                                 rr1[j,i]
> > <- exp(sum(dbinom(data[i,], 1, plogis(qq$nodes[j]-b), log = TRUE))) 
> *
> >                                                                
> > qq$weights[j]
> >                                                 }
> >                                 }
> >                 -sum(log(colSums(rr1)))
> >                 }
> >                 gradient <- function(b){
> >                                 p <- outer(qq$nodes, b, plogis) * L
> >                                 x <- t(matrix(colSums(data), nrow= 
> C, Q))
> >                                 w <- matrix(qq$weights, Q, C)
> >                                 rr2 <- (-x + p) * w
> >                                 -colSums(rr2)
> >                 }
> >                 opt <- optim(startVal, fn, gr = gradient, hessian = 
> TRUE,
> > method = "BFGS")
> >                 #opt <- nlminb(startVal, fn, gradient=gradient)
> >                 #list("coefficients" = opt$par, "LogLik" = -opt$value,
> > "Std.Error" = 1/sqrt(diag(opt$hessian)))
> >                 opt
> > }
> > 
> > 
> > 
> > ### latex describing the likelihood function
> > \documentclass{article}
> > \usepackage{latexsym}
> > \usepackage{amssymb,amsmath,bm}
> > \newcommand{\Prb}{\operatorname{Prob}}
> > \title{Marginal Maximum Likelihood for IRT Models}
> > \author{Harold C. Doran}
> > \begin{document}
> > \maketitle
> > The likelihood function is written as:
> > \begin{equation}
> > \label{eqn:mml}
> > f(x) =
> > \prod^N_{i=1}\int_{\Theta}\prod^K_{j=1}p(x|\theta,\beta)f(\theta)d\theta
> > \end{equation}
> > \noindent where $N$ indexes the total number of individuals, $K$ indexes
> > the total number of items, $p(x|\theta,\beta)$ is the data 
> likelihood and
> > $f(\theta)$ is a population distribution. For the rasch model, the data
> > likelihood is:
> > \begin{equation}
> > p(x|\theta,\beta) = \prod^N_{i=1}\prod^K_{j=1} \Pr(x_{ij} = 1 | \theta_i,
> > \beta_j)^{x_{ij}} \left[1 - \Pr(X_{ij} = 1 | \theta_i,
> > \beta_j)\right]^{(1-x_{ij})}
> > \end{equation}
> > \begin{equation}
> > \label{rasch}
> > \Pr(x_{ij} = 1 | \theta_i, \beta_j) = \frac{1}{1 +
> > e^{-(\theta_i-\beta_j)}} \quad i = (1, \ldots, K); j = (1, \ldots, N)
> > \end{equation}
> > \noindent where $\theta_i$ is the ability estimate of person $i$ and
> > $\beta_j$ is the location parameter for item $j$. The integral in
> > Equation~(\ref{eqn:mml}) has no closed form expression, so it is
> > approximated using Gauss-Hermite quadrature as:
> > \begin{equation}
> > \label{eqn:mml:approx}
> > f(x) \approx
> > \prod^N_{i=1}\sum_{q=1}^{Q}\prod^K_{j=1}p(x|\theta_q,\beta)w_q
> > \end{equation}
> > \noindent where $q$ indexes the node at quadrature point $q$ and $w$ 
> is
> > the weight at quadrature point $q$. With Equation~(\ref{eqn:mml:approx}),
> > the remaining challenge is to find $\underset{x}{\operatorname{arg\,max}}
> > \, f(x)$.
> > \end{document}
> > 
> > 	[[alternative HTML version deleted]]
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > 
> > PLEASE do read the posting guide
> > 
> > and provide commented, minimal, self-contained, reproducible code.
> > 
> > 
> 
> -- 
> View this message in context: 
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> 
> PLEASE do read the posting guide 
> and provide commented, minimal, self-contained, reproducible code.