[R] Why are Split and Tapply so slow with named vectors, why is a for loop faster than mapply
Peter Dalgaard
p.dalgaard at biostat.ku.dk
Fri Apr 9 00:09:02 CEST 2004
Brian.J.GREGOR at odot.state.or.us writes:
> What I've found, however, is that it is not easy (or I have not found the
> easy way) to split a named vector into a list that retains the vector names.
> For example, splitting an unnamed vector (70,000+) based on the chain
> numbers takes very little time:
> > system.time(actTimeList <- split(actTime, chainId))
> [1] 0.16 0.00 0.15 NA NA
>
> But if the vector is named, R will work for minutes and still not complete
> the job:
> > names(actTime) <- zoneNames
> > system.time(actTimeList <- split(actTime, chainId))
> Timing stopped at: 83.22 0.12 84.49 NA NA
>
> The same thing happens with using tapply with a named vector such as:
> tapply(actTime, chainId, function(x) x)
>
> Using the following function with a for loop accomplishes the job in a few
> seconds for all 70,000+ records:
> > splitWithNames <- function(dataVector, nameVector, factorVector){
> + dataList <- split(dataVector, factorVector)
> + nameList <- split(nameVector, factorVector)
> + listLength <- length(dataList)
> + namedDataList <- list(NULL)
> + for(i in 1:listLength){
> + x <- dataList[[i]]
> + names(x) <- nameList[[i]]
> + namedDataList[[i]] <- x
> + }
> + namedDataList
> + }
> > system.time(actTimeList <- splitWithNames(actTime, zoneNames, chainId))
> [1] 8.04 0.00 9.03 NA NA
>
> However if I rewrite the function to use mapply instead of a for loop, it
> again takes a long (undetermined) amount of time to complete. Here are the
> results for just 5000 and 10000 records. You can see that there is a
> scaling issue:
> > testfun <- function(dataVector, nameVector, factorVector){
> + dataList <- split(dataVector, factorVector)
> + nameList <- split(nameVector, factorVector)
> + nameFun <- function(x, xNames){
> + names(x) <- xNames
> + x
> + }
> + mapply(nameFun, dataList, nameList, SIMPLIFY=TRUE)
> + }
> > system.time(actTimeList <- testfun(actTime[1:5000], zoneNames[1:5000],
> chainId[1:5000]))
> [1] 2.99 0.00 2.98 NA NA
> > system.time(actTimeList <- testfun(actTime[1:10000], zoneNames[1:10000],
> chainId[1:10000]))
> [1] 10.64 0.00 10.64 NA NA
>
> My problem is solved for now with the home-brew splitWithNames function, but
> I'm curious about why named vectors slow down split and tapply so much and
> why a function using mapply is so much slower than a function that uses a
> for loop?
If you look inside split.default, you'll see that it only uses fast
internal code in simple cases:
if (is.null(attr(x, "class")) && is.null(names(x)))
return(.Internal(split(x, f)))
in the other cases, we use
for (k in lf) y[[k]] <- x[f %in% k]
and if lf is large, we get a large number of calls to %in%. This
wasn't really designed for that case, but I suppose we could be
smarter about it.
Wouldn't know about mapply, but are you sure you want SIMPLIFY=TRUE in
there???
--
O__ ---- Peter Dalgaard Blegdamsvej 3
c/ /'_ --- Dept. of Biostatistics 2200 Cph. N
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
More information about the R-help
mailing list