[R] performance gap between R 1.7.1 and 1.8.0

Fri Dec 5 09:11:09 CET 2003

If you look at the by.total components what you get is that for 1.7.1

>  summaryRprof("Rprof.out.1.7.1")$by.total[1:30,]
                   total.time total.pct self.time self.pct
"boot"                   1.72     100.0      0.00      0.0
...
"validObject"            0.14       8.1      0.00      0.0

and for R-devel

>  summaryRprof("Rprof.out.1.9.0")$by.total[1:30,]
                  total.time total.pct self.time self.pct
"source"                3.52     100.0      0.00      0.0
"boot"                  3.42      97.2      0.00      0.0
...
"validObject"           1.52      43.2      0.04      1.1
"findClass"             1.30      36.9      0.00      0.0
".requirePackage"       1.24      35.2      0.02      0.6

So it seems most of the increased time is coming from validObject, and
in particular from calls to findClass and then .requirePackage that
occur there.  The .requirePackage code does

> methods:::.requirePackage
function (package, useNamespace = FALSE) 
{
    ...
    if (is.character(package) && package != "flexmix") 
        value <- trySilent(getNamespace(package))
    ...
}

In this case it is called with "flexmix".  If flexmix had a name space
this would be a quick operation since the name space would be found in
the internal cache (it would be better to use getNamespace here since
that returns faster if the name space is loaded). But since flexmix
does not have a name space, the load fails every time, but not until
it has gone through a fair amount of effort to (re)discover that there
is no name space.

So the methods code needs to do some more caching of information here.
If I fiddle .requirePackage to not attempt to load flexmix as a name
space then the timings are

R-devel:	2.02
1.7.1:		1.72

validObject is still twice as slow in R-devel as in 1.7.1, which
accounts for about half the remaining speed loss, but that may be due
to its doing more useful validation.  Plotting the total timings for
the common functions under 1.7.1 and the modified R-devel that doesn't
try to load a flexmix name space suggest there may be a few otehr
places in methods that could use a bit of tuning, but minor compared
to the main issue here.

Luke

On Thu, 4 Dec 2003, Liaw, Andy wrote:

> > From: Martin Maechler [mailto:maechler at stat.math.ethz.ch] 
> [...]
> > A very first step of diagnosis might be to activate
> >   trace(read.dcf)
> >   trace(library)
> >   options(verbose = TRUE)
> > 
> > A step further might be to patch read.dcf such that it prints
> > info 
> >        if(getOption("verbose")) { <<print what I am trying to read>> }
> > 
> > Martin
> 
> To refresh people's memory, the problem is that the identical code took more
> than twice as long in R-1.8.x than in R-1.7.x.  I've managed to strip the
> code down to the following and still see the same problem.  It basically
> bootstraps mixture models to groups of data.  It uses boot() from the boot
> package and flexmix() with intercept only, in the flexmix package.
> 
> =============================================
> fitmix <- function(x, verbosefit=FALSE) {
>   bpunlist <- function(x) {
>     unlisted <- numeric()
>     for(i in 1:length(x at components)) {
>       unlisted <- c(unlisted, unlist(parameters(x, component=i)))
>     }
>     unlisted
>   }
>   unifit <-  flexmix(x ~ 1, k = 1, control=list(verbose=0))
>   unlist(unlist(list(G = 1, proportion = unifit at prior,
>                      param=matrix(bpunlist(unifit), nrow=2),
>                      bic=BIC(unifit), loglik=unifit at logLik)))
> }
> 
> bootmix <- function(data, i) {
>   d <- data
>   d.grps <- split.data.frame(d[i,], d[i, "id"])
>   comps <- vector("list", length=length(d.grps))
>   for (j in seq(along=comps)) {
>     thefit <- try(fitmix(d.grps[[j]][,1], verbosefit=FALSE))
>     comps[[j]] <- if (inherits(thefit, "try-error")) rep(NA, 48) else thefit
>   }
>   return(unlist(comps))
> }
> 
> ## The actual commands:
> require(flexmix)
> require(boot)
> set.seed(76421)
> x <- rnorm(3e3)
> id <- gl(10, 300)
> mydf <- data.frame(x, id)
> Rprof(filename=paste("Rprof.out", system("hostname -s", int=TRUE), sep="."))
> system.time(res <- boot(mydf, bootmix, strata=mydf[,"id"], R=5))
> Rprof(NULL)
> summaryRprof(filename=paste("Rprof.out", system("hostname -s", int=TRUE),
>                sep="."))$by.self[1:20,]
> =============================================
> 
> I ran this on a Xeon 2.4GHz running Mandrake 9.0, R compiled from source.
> 
> This took 2 seconds in R-1.7.1, with the output:
> 
>                   self.time self.pct total.time total.pct
> "FUN"                  0.18        9       0.24        12
> "names"                0.16        8       0.16         8
> "initialize"           0.10        5       0.80        40
> "apply"                0.08        4       0.30        15
> "[.data.frame"         0.06        3       0.22        11
> "inherits"             0.06        3       0.18         9
> "is.null"              0.06        3       0.06         3
> "match"                0.06        3       0.22        11
> "seq.default"          0.06        3       0.06         3
> "structure"            0.06        3       0.08         4
> "any"                  0.04        2       0.14         7
> "el"                   0.04        2       0.04         2
> "FLXfit"               0.04        2       1.18        59
> "lapply"               0.04        2       0.22        11
> "lm.wfit"              0.04        2       0.16         8
> "names<-.default"      0.04        2       0.04         2
> "slot<-"               0.04        2       0.18         9
> "<"                    0.02        1       0.02         1
> "|"                    0.02        1       0.02         1
> ":"                    0.02        1       0.02         1
> 
> ... and took over 7 seconds in R-1.8.1, with the output:
> 
>                 self.time self.pct total.time total.pct
> "paste"              0.52      6.7       1.10      14.1
> "read.dcf"           0.38      4.9       0.96      12.3
> "exists"             0.36      4.6       0.50       6.4
> "lapply"             0.36      4.6       1.36      17.4
> "names"              0.34      4.4       0.40       5.1
> "names<-"            0.32      4.1       0.44       5.6
> "inherits"           0.22      2.8       0.34       4.4
> "=="                 0.20      2.6       0.20       2.6
> ".Call"              0.20      2.6       0.20       2.6
> "seq"                0.18      2.3       0.50       6.4
> "seq.default"        0.18      2.3       0.28       3.6
> "dynGet"             0.16      2.1       0.46       5.9
> "topenv"             0.16      2.1       0.18       2.3
> "unique"             0.16      2.1       1.00      12.8
> "apply"              0.14      1.8       0.20       2.6
> "as.list"            0.14      1.8       0.18       2.3
> ".find.package"      0.14      1.8       1.88      24.1
> "sapply"             0.14      1.8       1.54      19.7
> "any"                0.12      1.5       1.68      21.5
> "initialize"         0.12      1.5       4.98      63.8
> 
> I then did the following:
> 
> > trace(read.dcf, recover)
> [1] "read.dcf"
> > res <- boot(mydf, bootmix, strata=mydf[,"id"], R=1)
> Tracing read.dcf(file.path(package.lib, package, "DESCRIPTION"), fields =
> "Namespace") on entry 
> 
> Enter a frame number, or 0 to exit   
> 1:boot(mydf, bootmix, strata = mydf[, "id"], R = 1) 
> 2:statistic(data, original, ...) 
> 3:try(fitmix(d.grps[[j]][, 1], verbosefit = FALSE)) 
> 4:fitmix(d.grps[[j]][, 1], verbosefit = FALSE) 
> 5:flexmix(x ~ 1, k = 1, control = list(verbose = 0)) 
> 6:flexmix(x ~ 1, k = 1, control = list(verbose = 0)) 
> 7:flexmix(formula = formula, data = data, k = k, cluster = cluster, model =
> list(F 
> 8:FLXglm() 
> 9:new("FLXmodel", weighted = TRUE, formula = formula, name = paste("FLXglm",
> famil 
> 10:initialize(value, ...) 
> 11:initialize(value, ...) 
> 12:getClass(Class) 
> 13:.requirePackage(package) 
> 14:trySilent(loadNamespace(package)) 
> 15:eval.parent(call) 
> 16:eval(expr, p) 
> 17:eval(expr, envir, enclos) 
> 18:try(loadNamespace(package)) 
> 19:loadNamespace(package) 
> 20:packageHasNamespace(package, package.lib) 
> 21:read.dcf(file.path(package.lib, package, "DESCRIPTION"), fields =
> "Namespace") 
> 
> I'm really out of ideas at this point.  Can anyone see the problem given the
> above, or at least tell me what to try next?
> 
> Best,
> Andy
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> 

-- 
Luke Tierney
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
   Actuarial Science
241 Schaeffer Hall                  email:      luke at stat.uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu