[R] optim or nlminb for minimization, which to believe?
Ravi Varadhan
rvaradhan at jhmi.edu
Tue Dec 1 03:42:30 CET 2009
Hi Harold,
Hans is right. You can see this if you use BB::spg
require(BB)
opt2 <- spg(startVal, fn) # this is fine
opt3 <- spg(startVal, fn,gradient) # this is not fine!
> opt3 <- spg(startVal, fn, gradient)
Gradient check details: max. relative difference in gradients= 0.001697913
analytic gradient: 0.1724284 -0.3382045 0.1724284 -0.3382045 -0.4902559
numerical gradient: 0.2826690 -0.2569404 0.2826690 -0.2569404 -0.4223231
Error in spg(startVal, fn, gradient) :
Analytic gradient does not seem correct! See comparison above. Fix it, remove it, or increase checkGrad.tol.
>
Ravi.
____________________________________________________________________
Ravi Varadhan, Ph.D.
Assistant Professor,
Division of Geriatric Medicine and Gerontology
School of Medicine
Johns Hopkins University
Ph. (410) 502-2619
email: rvaradhan at jhmi.edu
----- Original Message -----
From: Hans W Borchers <hwborchers at googlemail.com>
Date: Sunday, November 29, 2009 6:18 pm
Subject: Re: [R] optim or nlminb for minimization, which to believe?
To: r-help at r-project.org
> Your function named 'gradient' is not the correct gradient. Take as an
> example the following point x0, very near to the true minimum,
>
> x0 <- c(-0.2517964, 0.4898680, -0.2517962, 0.4898681, 0.7500995)
>
> then you get
>
> > gradient(x0)
> [1] -0.0372110470 0.0001816991 -0.0372102284 0.0001820976
> 0.0144292657
>
> but the numerical gradient is different:
>
> > library(numDeriv)
> > grad(fn, x0)
> [1] -6.151645e-07 -5.507219e-07 1.969143e-07 -1.563892e-07
> -4.955502e-08
>
> that is the derivative is close to 0 in any direction -- as it should
> be for
> an optimum.
>
> No wonder, optim et al. get confused when applying this 'gradient'.
>
> Regards
> Hans Werner
>
>
> Doran, Harold wrote:
> >
> > I have constructed the function mml2 (below) based on the likelihood
> > function described in the minimal latex I have pasted below for
> anyone who
> > wants to look at it. This function finds parameter estimates for a basic
> > Rasch (IRT) model. Using the function without the gradient, using either
> > nlminb or optim returns the correct parameter estimates and, in the
> case
> > of optim, the correct standard errors.
> >
> > By correct, I mean they match another software program as well as the
> > rasch() function in the ltm package.
> >
> > Now, when I pass the gradient to optim, I get a message of successful
> > convergence, but the parameter estimates are not correct, but they are
> > *very* close to being correct. But, when I use nlminb with the
> gradient, I
> > get a message of false convergence and again, the estimates are off,
> but
> > again very close to being correct.
> >
> > This is best illustrated via the examples:
> >
> > ### Sample data set
> > set.seed(1234)
> > tmp <- data.frame(item1 = sample(c(0,1), 20, replace=TRUE), item2 =
> > sample(c(0,1), 20, replace=TRUE), item3 = sample(c(0,1), 20,
> > replace=TRUE),item4 = sample(c(0,1), 20, replace=TRUE),item5 =
> > sample(c(0,1), 20, replace=TRUE))
> >
> > ## Use function mml2 (below) with optim with use of gradient
> >> mml2(tmp,Q=10)
> > $par
> > [1] -0.2438733 0.4889333 -0.2438733 0.4889333 0.7464162
> >
> > $value
> > [1] 63.86376
> >
> > $counts
> > function gradient
> > 45 6
> >
> > $convergence
> > [1] 0
> >
> > $message
> > NULL
> >
> > $hessian
> > [,1] [,2] [,3] [,4] [,5]
> > [1,] 4.095479 0.000000 0.000000 0.000000 0.000000
> > [2,] 0.000000 3.986293 0.000000 0.000000 0.000000
> > [3,] 0.000000 0.000000 4.095479 0.000000 0.000000
> > [4,] 0.000000 0.000000 0.000000 3.986293 0.000000
> > [5,] 0.000000 0.000000 0.000000 0.000000 3.800898
> >
> > ## Use same function but use nlminb with use of gradient
> >> mml2(tmp,Q=10)
> > $par
> > [1] -0.2456398 0.4948889 -0.2456398 0.4948889 0.7516308
> >
> > $objective
> > [1] 63.86364
> >
> > $convergence
> > [1] 1
> >
> > $message
> > [1] "false convergence (8)"
> >
> > $iterations
> > [1] 4
> >
> > $evaluations
> > function gradient
> > 41 4
> >
> >
> > ### use nlminb but turn off use of gradient
> >> mml2(tmp,Q=10)
> > $par
> > [1] -0.2517961 0.4898682 -0.2517961 0.4898682 0.7500994
> >
> > $objective
> > [1] 63.8635
> >
> > $convergence
> > [1] 0
> >
> > $message
> > [1] "relative convergence (4)"
> >
> > $iterations
> > [1] 8
> >
> > $evaluations
> > function gradient
> > 11 64
> >
> > ### Use optim and turn off gradient
> >
> >> mml2(tmp,Q=10)
> > $par
> > [1] -0.2517990 0.4898676 -0.2517990 0.4898676 0.7500906
> >
> > $value
> > [1] 63.8635
> >
> > $counts
> > function gradient
> > 22 7
> >
> > $convergence
> > [1] 0
> >
> > $message
> > NULL
> >
> > $hessian
> > [,1] [,2] [,3] [,4] [,5]
> > [1,] 3.6311153 -0.3992959 -0.4224747 -0.3992959 -0.3764526
> > [2,] -0.3992959 3.5338195 -0.3992959 -0.3960956 -0.3798141
> > [3,] -0.4224747 -0.3992959 3.6311153 -0.3992959 -0.3764526
> > [4,] -0.3992959 -0.3960956 -0.3992959 3.5338195 -0.3798141
> > [5,] -0.3764526 -0.3798141 -0.3764526 -0.3798141 3.3784816
> >
> > The parameter estimates with and without the use of the gradient are
> so
> > close that I am inclined to believe that the gradient is correct and
> maybe
> > the problem is elsewhere.
> >
> > It seems odd that optim seems to converge but nlminb does not with
> the use
> > of the gradient. But, with the use of the gradient in either case, the
> > parameter estimates differ from what I think are the correct values.
> So,
> > at this point I am unclear if the problem is somewhere in the way the
> > functions are used or how I am passing the gradient or if the
> problem lies
> > in the way I have constructed the gradient itself.
> >
> > Below is the function and also some latex for those interested in looking
> > at the likelihood function.
> >
> > Thanks for any reactions
> > Harold
> >
> >> sessionInfo()
> > R version 2.10.0 (2009-10-26)
> > i386-pc-mingw32
> >
> > locale:
> > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
> > States.1252
> > [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> > [5] LC_TIME=English_United States.1252
> >
> > attached base packages:
> > [1] stats graphics grDevices utils datasets methods base
> >
> > other attached packages:
> > [1] ltm_0.9-2 polycor_0.7-7 sfsmisc_1.0-9 mvtnorm_0.9-8
> msm_0.9.4
> > MASS_7.3-3 MiscPsycho_1.5
> > [8] statmod_1.4.1
> >
> > loaded via a namespace (and not attached):
> > [1] splines_2.10.0 survival_2.35-7 tools_2.10.0
> >
> >
> > mml2 <- function(data, Q, startVal = NULL, gr = TRUE, ...){
> > if(!is.null(startVal) && length(startVal) !=
> ncol(data) ){
> > stop("Length of argument startVal not
> > equal to the number of parameters estimated")
> > }
> > if(!is.null(startVal)){
> > startVal <- startVal
> > } else {
> > p <- colMeans(data)
> > startVal <- as.vector(log((1 - p)/p))
> > }
> > qq <- gauss.quad.prob(Q, dist = 'normal')
> > rr1 <- matrix(0, nrow = Q, ncol = nrow(data))
> > data <- as.matrix(data)
> > L <- nrow(data)
> > C <- ncol(data)
> > fn <- function(b){
> > for(j in 1:Q){
> > for(i in 1:nrow(data)){
> > rr1[j,i]
> > <- exp(sum(dbinom(data[i,], 1, plogis(qq$nodes[j]-b), log = TRUE)))
> *
> >
> > qq$weights[j]
> > }
> > }
> > -sum(log(colSums(rr1)))
> > }
> > gradient <- function(b){
> > p <- outer(qq$nodes, b, plogis) * L
> > x <- t(matrix(colSums(data), nrow=
> C, Q))
> > w <- matrix(qq$weights, Q, C)
> > rr2 <- (-x + p) * w
> > -colSums(rr2)
> > }
> > opt <- optim(startVal, fn, gr = gradient, hessian =
> TRUE,
> > method = "BFGS")
> > #opt <- nlminb(startVal, fn, gradient=gradient)
> > #list("coefficients" = opt$par, "LogLik" = -opt$value,
> > "Std.Error" = 1/sqrt(diag(opt$hessian)))
> > opt
> > }
> >
> >
> >
> > ### latex describing the likelihood function
> > \documentclass{article}
> > \usepackage{latexsym}
> > \usepackage{amssymb,amsmath,bm}
> > \newcommand{\Prb}{\operatorname{Prob}}
> > \title{Marginal Maximum Likelihood for IRT Models}
> > \author{Harold C. Doran}
> > \begin{document}
> > \maketitle
> > The likelihood function is written as:
> > \begin{equation}
> > \label{eqn:mml}
> > f(x) =
> > \prod^N_{i=1}\int_{\Theta}\prod^K_{j=1}p(x|\theta,\beta)f(\theta)d\theta
> > \end{equation}
> > \noindent where $N$ indexes the total number of individuals, $K$ indexes
> > the total number of items, $p(x|\theta,\beta)$ is the data
> likelihood and
> > $f(\theta)$ is a population distribution. For the rasch model, the data
> > likelihood is:
> > \begin{equation}
> > p(x|\theta,\beta) = \prod^N_{i=1}\prod^K_{j=1} \Pr(x_{ij} = 1 | \theta_i,
> > \beta_j)^{x_{ij}} \left[1 - \Pr(X_{ij} = 1 | \theta_i,
> > \beta_j)\right]^{(1-x_{ij})}
> > \end{equation}
> > \begin{equation}
> > \label{rasch}
> > \Pr(x_{ij} = 1 | \theta_i, \beta_j) = \frac{1}{1 +
> > e^{-(\theta_i-\beta_j)}} \quad i = (1, \ldots, K); j = (1, \ldots, N)
> > \end{equation}
> > \noindent where $\theta_i$ is the ability estimate of person $i$ and
> > $\beta_j$ is the location parameter for item $j$. The integral in
> > Equation~(\ref{eqn:mml}) has no closed form expression, so it is
> > approximated using Gauss-Hermite quadrature as:
> > \begin{equation}
> > \label{eqn:mml:approx}
> > f(x) \approx
> > \prod^N_{i=1}\sum_{q=1}^{Q}\prod^K_{j=1}p(x|\theta_q,\beta)w_q
> > \end{equation}
> > \noindent where $q$ indexes the node at quadrature point $q$ and $w$
> is
> > the weight at quadrature point $q$. With Equation~(\ref{eqn:mml:approx}),
> > the remaining challenge is to find $\underset{x}{\operatorname{arg\,max}}
> > \, f(x)$.
> > \end{document}
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> >
> > PLEASE do read the posting guide
> >
> > and provide commented, minimal, self-contained, reproducible code.
> >
> >
>
> --
> View this message in context:
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
>
> PLEASE do read the posting guide
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list