[R] Why are Split and Tapply so slow with named vectors, why is a for loop faster than mapply

Fri Apr 9 00:09:02 CEST 2004

Brian.J.GREGOR at odot.state.or.us writes:

> What I've found, however, is that it is not easy (or I have not found the
> easy way) to split a named vector into a list that retains the vector names.
> For example, splitting an unnamed vector (70,000+) based on the chain
> numbers takes very little time:
> > system.time(actTimeList <- split(actTime, chainId))
> [1] 0.16 0.00 0.15   NA   NA
> 
> But if the vector is named, R will work for minutes and still not complete
> the job:
> > names(actTime) <- zoneNames
> > system.time(actTimeList <- split(actTime, chainId))
> Timing stopped at: 83.22 0.12 84.49 NA NA
> 
> The same thing happens with using tapply with a named vector such as:
> tapply(actTime, chainId, function(x) x)
> 
> Using the following function with a for loop accomplishes the job in a few
> seconds for all 70,000+ records: 
> > splitWithNames <- function(dataVector, nameVector, factorVector){
> +     dataList <- split(dataVector, factorVector)
> +     nameList <- split(nameVector, factorVector)
> +     listLength <- length(dataList)
> +     namedDataList <- list(NULL)
> +     for(i in 1:listLength){
> +         x <- dataList[[i]]
> +         names(x) <- nameList[[i]]
> +         namedDataList[[i]] <- x
> +         }
> +     namedDataList
> +     }
> > system.time(actTimeList <- splitWithNames(actTime, zoneNames, chainId))
> [1] 8.04 0.00 9.03   NA   NA
> 
> However if I rewrite the function to use mapply instead of a for loop, it
> again takes a long (undetermined) amount of time to complete. Here are the
> results for just 5000  and 10000 records. You can see that there is a
> scaling issue:
> > testfun <- function(dataVector, nameVector, factorVector){
> +     dataList <- split(dataVector, factorVector)
> +     nameList <- split(nameVector, factorVector)
> +     nameFun <- function(x, xNames){
> +         names(x) <- xNames
> +         x
> +         }
> +     mapply(nameFun, dataList, nameList, SIMPLIFY=TRUE)
> +     }
> > system.time(actTimeList <- testfun(actTime[1:5000], zoneNames[1:5000],
> chainId[1:5000]))
> [1] 2.99 0.00 2.98   NA   NA
> > system.time(actTimeList <- testfun(actTime[1:10000], zoneNames[1:10000],
> chainId[1:10000]))
> [1] 10.64  0.00 10.64    NA    NA
> 
> My problem is solved for now with the home-brew splitWithNames function, but
> I'm curious about why named vectors slow down split and tapply so much and
> why a function using mapply is so much slower than a function that uses a
> for loop?

If you look inside split.default, you'll see that it only uses fast
internal code in simple cases:

    if (is.null(attr(x, "class")) && is.null(names(x)))
        return(.Internal(split(x, f)))

in the other cases, we use

    for (k in lf) y[[k]] <- x[f %in% k]

and if lf is large, we get a large number of calls to %in%. This
wasn't really designed for that case, but I suppose we could be
smarter about it.

Wouldn't know about mapply, but are you sure you want SIMPLIFY=TRUE in
there???

-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907