[R] optim or nlminb for minimization, which to believe?
Hans W Borchers
hwborchers at googlemail.com
Mon Nov 30 00:16:56 CET 2009
Your function named 'gradient' is not the correct gradient. Take as an
example the following point x0, very near to the true minimum,
x0 <- c(-0.2517964, 0.4898680, -0.2517962, 0.4898681, 0.7500995)
then you get
> gradient(x0)
[1] -0.0372110470 0.0001816991 -0.0372102284 0.0001820976
0.0144292657
but the numerical gradient is different:
> library(numDeriv)
> grad(fn, x0)
[1] -6.151645e-07 -5.507219e-07 1.969143e-07 -1.563892e-07
-4.955502e-08
that is the derivative is close to 0 in any direction -- as it should be for
an optimum.
No wonder, optim et al. get confused when applying this 'gradient'.
Regards
Hans Werner
Doran, Harold wrote:
>
> I have constructed the function mml2 (below) based on the likelihood
> function described in the minimal latex I have pasted below for anyone who
> wants to look at it. This function finds parameter estimates for a basic
> Rasch (IRT) model. Using the function without the gradient, using either
> nlminb or optim returns the correct parameter estimates and, in the case
> of optim, the correct standard errors.
>
> By correct, I mean they match another software program as well as the
> rasch() function in the ltm package.
>
> Now, when I pass the gradient to optim, I get a message of successful
> convergence, but the parameter estimates are not correct, but they are
> *very* close to being correct. But, when I use nlminb with the gradient, I
> get a message of false convergence and again, the estimates are off, but
> again very close to being correct.
>
> This is best illustrated via the examples:
>
> ### Sample data set
> set.seed(1234)
> tmp <- data.frame(item1 = sample(c(0,1), 20, replace=TRUE), item2 =
> sample(c(0,1), 20, replace=TRUE), item3 = sample(c(0,1), 20,
> replace=TRUE),item4 = sample(c(0,1), 20, replace=TRUE),item5 =
> sample(c(0,1), 20, replace=TRUE))
>
> ## Use function mml2 (below) with optim with use of gradient
>> mml2(tmp,Q=10)
> $par
> [1] -0.2438733 0.4889333 -0.2438733 0.4889333 0.7464162
>
> $value
> [1] 63.86376
>
> $counts
> function gradient
> 45 6
>
> $convergence
> [1] 0
>
> $message
> NULL
>
> $hessian
> [,1] [,2] [,3] [,4] [,5]
> [1,] 4.095479 0.000000 0.000000 0.000000 0.000000
> [2,] 0.000000 3.986293 0.000000 0.000000 0.000000
> [3,] 0.000000 0.000000 4.095479 0.000000 0.000000
> [4,] 0.000000 0.000000 0.000000 3.986293 0.000000
> [5,] 0.000000 0.000000 0.000000 0.000000 3.800898
>
> ## Use same function but use nlminb with use of gradient
>> mml2(tmp,Q=10)
> $par
> [1] -0.2456398 0.4948889 -0.2456398 0.4948889 0.7516308
>
> $objective
> [1] 63.86364
>
> $convergence
> [1] 1
>
> $message
> [1] "false convergence (8)"
>
> $iterations
> [1] 4
>
> $evaluations
> function gradient
> 41 4
>
>
> ### use nlminb but turn off use of gradient
>> mml2(tmp,Q=10)
> $par
> [1] -0.2517961 0.4898682 -0.2517961 0.4898682 0.7500994
>
> $objective
> [1] 63.8635
>
> $convergence
> [1] 0
>
> $message
> [1] "relative convergence (4)"
>
> $iterations
> [1] 8
>
> $evaluations
> function gradient
> 11 64
>
> ### Use optim and turn off gradient
>
>> mml2(tmp,Q=10)
> $par
> [1] -0.2517990 0.4898676 -0.2517990 0.4898676 0.7500906
>
> $value
> [1] 63.8635
>
> $counts
> function gradient
> 22 7
>
> $convergence
> [1] 0
>
> $message
> NULL
>
> $hessian
> [,1] [,2] [,3] [,4] [,5]
> [1,] 3.6311153 -0.3992959 -0.4224747 -0.3992959 -0.3764526
> [2,] -0.3992959 3.5338195 -0.3992959 -0.3960956 -0.3798141
> [3,] -0.4224747 -0.3992959 3.6311153 -0.3992959 -0.3764526
> [4,] -0.3992959 -0.3960956 -0.3992959 3.5338195 -0.3798141
> [5,] -0.3764526 -0.3798141 -0.3764526 -0.3798141 3.3784816
>
> The parameter estimates with and without the use of the gradient are so
> close that I am inclined to believe that the gradient is correct and maybe
> the problem is elsewhere.
>
> It seems odd that optim seems to converge but nlminb does not with the use
> of the gradient. But, with the use of the gradient in either case, the
> parameter estimates differ from what I think are the correct values. So,
> at this point I am unclear if the problem is somewhere in the way the
> functions are used or how I am passing the gradient or if the problem lies
> in the way I have constructed the gradient itself.
>
> Below is the function and also some latex for those interested in looking
> at the likelihood function.
>
> Thanks for any reactions
> Harold
>
>> sessionInfo()
> R version 2.10.0 (2009-10-26)
> i386-pc-mingw32
>
> locale:
> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
> States.1252
> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] ltm_0.9-2 polycor_0.7-7 sfsmisc_1.0-9 mvtnorm_0.9-8 msm_0.9.4
> MASS_7.3-3 MiscPsycho_1.5
> [8] statmod_1.4.1
>
> loaded via a namespace (and not attached):
> [1] splines_2.10.0 survival_2.35-7 tools_2.10.0
>
>
> mml2 <- function(data, Q, startVal = NULL, gr = TRUE, ...){
> if(!is.null(startVal) && length(startVal) != ncol(data) ){
> stop("Length of argument startVal not
> equal to the number of parameters estimated")
> }
> if(!is.null(startVal)){
> startVal <- startVal
> } else {
> p <- colMeans(data)
> startVal <- as.vector(log((1 - p)/p))
> }
> qq <- gauss.quad.prob(Q, dist = 'normal')
> rr1 <- matrix(0, nrow = Q, ncol = nrow(data))
> data <- as.matrix(data)
> L <- nrow(data)
> C <- ncol(data)
> fn <- function(b){
> for(j in 1:Q){
> for(i in 1:nrow(data)){
> rr1[j,i]
> <- exp(sum(dbinom(data[i,], 1, plogis(qq$nodes[j]-b), log = TRUE))) *
>
> qq$weights[j]
> }
> }
> -sum(log(colSums(rr1)))
> }
> gradient <- function(b){
> p <- outer(qq$nodes, b, plogis) * L
> x <- t(matrix(colSums(data), nrow= C, Q))
> w <- matrix(qq$weights, Q, C)
> rr2 <- (-x + p) * w
> -colSums(rr2)
> }
> opt <- optim(startVal, fn, gr = gradient, hessian = TRUE,
> method = "BFGS")
> #opt <- nlminb(startVal, fn, gradient=gradient)
> #list("coefficients" = opt$par, "LogLik" = -opt$value,
> "Std.Error" = 1/sqrt(diag(opt$hessian)))
> opt
> }
>
>
>
> ### latex describing the likelihood function
> \documentclass{article}
> \usepackage{latexsym}
> \usepackage{amssymb,amsmath,bm}
> \newcommand{\Prb}{\operatorname{Prob}}
> \title{Marginal Maximum Likelihood for IRT Models}
> \author{Harold C. Doran}
> \begin{document}
> \maketitle
> The likelihood function is written as:
> \begin{equation}
> \label{eqn:mml}
> f(x) =
> \prod^N_{i=1}\int_{\Theta}\prod^K_{j=1}p(x|\theta,\beta)f(\theta)d\theta
> \end{equation}
> \noindent where $N$ indexes the total number of individuals, $K$ indexes
> the total number of items, $p(x|\theta,\beta)$ is the data likelihood and
> $f(\theta)$ is a population distribution. For the rasch model, the data
> likelihood is:
> \begin{equation}
> p(x|\theta,\beta) = \prod^N_{i=1}\prod^K_{j=1} \Pr(x_{ij} = 1 | \theta_i,
> \beta_j)^{x_{ij}} \left[1 - \Pr(X_{ij} = 1 | \theta_i,
> \beta_j)\right]^{(1-x_{ij})}
> \end{equation}
> \begin{equation}
> \label{rasch}
> \Pr(x_{ij} = 1 | \theta_i, \beta_j) = \frac{1}{1 +
> e^{-(\theta_i-\beta_j)}} \quad i = (1, \ldots, K); j = (1, \ldots, N)
> \end{equation}
> \noindent where $\theta_i$ is the ability estimate of person $i$ and
> $\beta_j$ is the location parameter for item $j$. The integral in
> Equation~(\ref{eqn:mml}) has no closed form expression, so it is
> approximated using Gauss-Hermite quadrature as:
> \begin{equation}
> \label{eqn:mml:approx}
> f(x) \approx
> \prod^N_{i=1}\sum_{q=1}^{Q}\prod^K_{j=1}p(x|\theta_q,\beta)w_q
> \end{equation}
> \noindent where $q$ indexes the node at quadrature point $q$ and $w$ is
> the weight at quadrature point $q$. With Equation~(\ref{eqn:mml:approx}),
> the remaining challenge is to find $\underset{x}{\operatorname{arg\,max}}
> \, f(x)$.
> \end{document}
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
--
View this message in context: http://n4.nabble.com/optim-or-nlminb-for-minimization-which-to-believe-tp930841p930942.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list