[R] spending most of my time in assignments?
Ross Boylan
ross at biostat.ucsf.edu
Thu Dec 26 21:45:07 CET 2013
On Thu, 2013-12-19 at 20:37 -0500, Duncan Murdoch wrote:
> On 13-12-19 6:37 PM, Ross Boylan wrote:
> > My code seems to be spending most of its time in assignment statements,
> > in some cases simple assignment of a model frame or model matrix.
> >
> > Can anyone provide any insights into what's going on, or how to speed
> > things up?
>
> You are seeing a lot of time being spent on complex assignments. For
> example, line 158 is
>
> data(sims.c1[[k]]) <- sp
>
> That makes a function call to `data<-` to do the assignment, and that
> could be slow. Since it's an S4 method there's a bunch of machinery
> involved in dispatching it; most of that would not have line number
> information, so it'll be charged to that line.
>
> I can't really suggest how to speed it up.
>
> Duncan Murdoch
Simply reexpressing the same computations without assigning to S4 slots
or dispatching using S4 cut the execution time to under 50% of what it
had been.
I attempted to recreate this with a small bit of code, but got nothing
like the effect in the original.
Here's the example; v1 is like my original code and v2 is like the
restructured code:
slow <- setClass("slow",
representation(form="formula", data="data.frame", mm="matrix"))
slowb <- setClass("slowb", representation(form="formula", data="data.frame", mm="matrix"))
slowc <- setClass("slowc", representation(form="formula", data="data.frame", mm="matrix"))
# original had only 3 classes
# fake data. Using my real data didn't seem to affect relative times much.
# real data had ~ 100 columns in model matrix
mydata <- data.frame(a=rnorm(1500), b=rnorm(1500), c=rnorm(1500),d=as.factor(rep(c("x", "y", "z"), 500)))
myformula <- ~a*d + b*d + c*d
mymm <- model.matrix(myformula, mydata)
if (!isGeneric("putData<-"))
setGeneric("putData<-", function(obj, value) standardGeneric("putData<-"))
setMethod("putData<-", c("slow", "data.frame"),
function(obj, value){
myf <- model.frame(obj at form, value)
obj at data <- myf
obj at mm <- model.matrix(obj at form, myf)
obj
}
)
setMethod("putData<-", c("slowb", "data.frame"),
function(obj, value){
myf <- model.frame(obj at form, value)
obj at data <- myf
obj at mm <- model.matrix(obj at form, myf)
obj
}
)
setMethod("putData<-", c("slowc", "data.frame"),
function(obj, value){
myf <- model.frame(obj at form, value)
obj at data <- myf
obj at mm <- model.matrix(obj at form, myf)
obj
}
)
v1 <- function(n) {
s <- list(slow(form=myformula, mm=mymm), 1:5)
for (i in 1:n) {
mydata$b <- rnorm(nrow(mydata))
putData(s[[1]]) <- mydata
}
mm <- s[[1]]@mm
}
# v2 eliminates the dispatch on putData and the assignment
# to the S4 slot
v2 <- function(n) {
s <- slow(form=myformula, mm=mymm)
for (i in 1:n) {
mydata$b <- rnorm(nrow(mydata))
myf <- model.frame(s at form, mydata)
mm <- model.matrix(s at form, myf)
}
mm
}
> system.time(r <- v1(100))
user system elapsed
0.304 0.000 0.307
> system.time(r <- v2(100))
user system elapsed
0.26 0.00 0.26
Ross Boylan
>
> >
> > For starters, is it possible that the reports are not accurate, or that
> > I am misreading them. In R 3.0.1 (running under ESS):
> > > Rprof(line.profiling=TRUE)
> > > system.time(r <- totalEffect(dodata[[1]], dodata[[2]], 1:3, 4))
> > user system elapsed
> > 21.629 0.756 22.469
> > !> Rprof(NULL)
> > > summaryRprof(lines="both")
> > $by.self
> > self.time self.pct total.time total.pct
> > box.R#158 6.74 29.56 13.06 57.28
> > simulator.multinomial.R#64 2.92 12.81 2.96 12.98
> > simulator.multinomial.R#63 2.76 12.11 2.76 12.11
> > box.R#171 2.54 11.14 5.08 22.28
> > simulator.d1.R#70 0.98 4.30 0.98 4.30
> > simulator.d1.R#71 0.98 4.30 0.98 4.30
> > densMap.R#42 0.72 3.16 0.86 3.77
> > "standardGeneric" 0.52 2.28 11.30 49.56
> > ......
> >
> > Here's some of the code, with comments at the line numbers
> > box.R:
> > sp <- merge(sexpartner, data, by="studyidx")
> > sp$y <- numFactor(sp$pEthnic) #I think y is not used but must be present
> > data(sims.c1[[k]]) <- sp ###<<<<< line 158
> > sp0 <- sp
> > sp <- sim(sims.c1[[k]], i)
> > ctable[[k]] <- update.c1(ctable[[k]], sp)
> > if (is.null(i.c1.in)) {
> > i.c1.in <- match("pEthnic", colnames(sp0))
> > i.c1.out <- match(c("studyidx", "n", "pEthnic"), colnames(sp))
> > }
> > sp0 <- merge(sp0[,-i.c1.in], sp[,i.c1.out], by=c("studyidx", "n"))
> > # d1
> > sp0 <- sp0[sp0$pIsMale == 1,]
> > # avoid lots of conversion warnings
> > sp0$pEthnic <- factor(sp0$pEthnic, levels=partRaceLevels)
> > data(sims.d1[[k]]) <- sp0 ###<<<<< line 171
> > sp <- sim(sims.d1[[k]], i)
> > dtable[[k]] <- update.d1(dtable[[k]], sp)
> > rngstate[[k]] <- .Random.seed
> > The timing seems odd since it doesn't appear there's anything to do at
> > the 2 lines except invoke data<-, but if that's slow I would expect the
> > time to go to the data<- function (in a different file) and not to the
> > call.
> >
> > In fact the other big time items are inside the data<- functions.
> > simulator.multinomial.R:
> >
> > setMethod("data<-", c("simulator.multinomial", "data.frame"),
> > function(obj, value) {
> > mf <- model.frame(obj at dataFormula, data=value)
> > mf$iCluster <- fromOrig(obj at idmap, as.character(mf$studyidx))
> > if (any(is.na(mf$iCluster)))
> > stop("New studyidx--need to draw from meta distn")
> > mm <- model.matrix(obj at modelFormula, data=mf)
> > obj at data <- mf ##<<< line 63
> > obj at mm <- mm ##<<< line 64
> > return(obj)
> > })
> >
> > The mm and data slots have type restrictions, but no other validation
> > tests.
> > setClass("simulator.multinomial",
> > representation(fit="stanfit", idmap="sIDMap",
> > modelFormula="formula",
> > categories="ANY", # could be factor or character
> > # categories should be in the order of their numeric codes in y
> > # cached results
> > coef="list",
> > data="data.frame",
> > dataFormula="formula",
> > mm="matrix"))
> > Does it matter that, e.g., a model frame is more than a vanilla data frame?
> >
> > I thought assignment, given R's lazy copying behavior, was essentially
> > resetting a pointer, and so should be fast.
> >
> > Or maybe the time is going to garbage collecting the previous contents
> > of the slots?
> >
> > Ross Boylan
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
More information about the R-help
mailing list