[R] Why are Split and Tapply so slow with named vectors, why is a for loop faster than mapply
Brian.J.GREGOR@odot.state.or.us
Brian.J.GREGOR at odot.state.or.us
Thu Apr 8 23:12:39 CEST 2004
First, here's the problem I'm working on so you understand the context. I
have a data frame of travel activity characteristics with 70,000+ records.
These activities are identified by unique chain numbers. (Activities are
part of trip chains.) There are 17,500 chains.
I use the chain numbers as factors to split various data fields into lists
of chain characteristics with each element of the list representing one
chain. For example:
> betaHomeDist[1:3]
$"400001111"
1316 2319 2317 1364 1316
0.000000 14.930820 24.431210 6.174959 0.000000
$"400001211"
1316 2319 2319 1364 1316
0.000000 14.930820 14.930820 6.174959 0.000000
$"400001212"
1316 1364 2324 1364 1316
0.000000 6.174959 14.392375 6.174959 0.000000
Where each element of the list is a named vector. Each vector element is
named with the zone that the activity occurred within. I use these names in
subsequent computations.
What I've found, however, is that it is not easy (or I have not found the
easy way) to split a named vector into a list that retains the vector names.
For example, splitting an unnamed vector (70,000+) based on the chain
numbers takes very little time:
> system.time(actTimeList <- split(actTime, chainId))
[1] 0.16 0.00 0.15 NA NA
But if the vector is named, R will work for minutes and still not complete
the job:
> names(actTime) <- zoneNames
> system.time(actTimeList <- split(actTime, chainId))
Timing stopped at: 83.22 0.12 84.49 NA NA
The same thing happens with using tapply with a named vector such as:
tapply(actTime, chainId, function(x) x)
Using the following function with a for loop accomplishes the job in a few
seconds for all 70,000+ records:
> splitWithNames <- function(dataVector, nameVector, factorVector){
+ dataList <- split(dataVector, factorVector)
+ nameList <- split(nameVector, factorVector)
+ listLength <- length(dataList)
+ namedDataList <- list(NULL)
+ for(i in 1:listLength){
+ x <- dataList[[i]]
+ names(x) <- nameList[[i]]
+ namedDataList[[i]] <- x
+ }
+ namedDataList
+ }
> system.time(actTimeList <- splitWithNames(actTime, zoneNames, chainId))
[1] 8.04 0.00 9.03 NA NA
However if I rewrite the function to use mapply instead of a for loop, it
again takes a long (undetermined) amount of time to complete. Here are the
results for just 5000 and 10000 records. You can see that there is a
scaling issue:
> testfun <- function(dataVector, nameVector, factorVector){
+ dataList <- split(dataVector, factorVector)
+ nameList <- split(nameVector, factorVector)
+ nameFun <- function(x, xNames){
+ names(x) <- xNames
+ x
+ }
+ mapply(nameFun, dataList, nameList, SIMPLIFY=TRUE)
+ }
> system.time(actTimeList <- testfun(actTime[1:5000], zoneNames[1:5000],
chainId[1:5000]))
[1] 2.99 0.00 2.98 NA NA
> system.time(actTimeList <- testfun(actTime[1:10000], zoneNames[1:10000],
chainId[1:10000]))
[1] 10.64 0.00 10.64 NA NA
My problem is solved for now with the home-brew splitWithNames function, but
I'm curious about why named vectors slow down split and tapply so much and
why a function using mapply is so much slower than a function that uses a
for loop?
My computer is a 800+ MHz Pentium III with 512 Mb of memory. The operating
system is Windows NT 4.0. My R version is 1.8.1.
Thank you.
Brian Gregor, P.E.
Transportation Planning Analysis Unit
Oregon Department of Transportation
Brian.J.GREGOR at odot.state.or.us
(503) 986-4120
More information about the R-help
mailing list