[R] Why are Split and Tapply so slow with named vectors, why is a for loop faster than mapply

Thu Apr 8 23:12:39 CEST 2004

First, here's the problem I'm working on so you understand the context. I
have a data frame of travel activity characteristics with 70,000+ records.
These activities are identified by unique chain numbers. (Activities are
part of trip chains.) There are 17,500 chains. 

I use the chain numbers as factors to split various data fields into lists
of chain characteristics with each element of the list representing one
chain. For example:

> betaHomeDist[1:3]
$"400001111"
     1316      2319      2317      1364      1316 
 0.000000 14.930820 24.431210  6.174959  0.000000 

$"400001211"
     1316      2319      2319      1364      1316 
 0.000000 14.930820 14.930820  6.174959  0.000000 

$"400001212"
     1316      1364      2324      1364      1316 
 0.000000  6.174959 14.392375  6.174959  0.000000 

Where each element of the list is a named vector. Each vector element is
named with the zone that the activity occurred within. I use these names in
subsequent computations.

What I've found, however, is that it is not easy (or I have not found the
easy way) to split a named vector into a list that retains the vector names.
For example, splitting an unnamed vector (70,000+) based on the chain
numbers takes very little time:
> system.time(actTimeList <- split(actTime, chainId))
[1] 0.16 0.00 0.15   NA   NA

But if the vector is named, R will work for minutes and still not complete
the job:
> names(actTime) <- zoneNames
> system.time(actTimeList <- split(actTime, chainId))
Timing stopped at: 83.22 0.12 84.49 NA NA

The same thing happens with using tapply with a named vector such as:
tapply(actTime, chainId, function(x) x)

Using the following function with a for loop accomplishes the job in a few
seconds for all 70,000+ records: 
> splitWithNames <- function(dataVector, nameVector, factorVector){
+     dataList <- split(dataVector, factorVector)
+     nameList <- split(nameVector, factorVector)
+     listLength <- length(dataList)
+     namedDataList <- list(NULL)
+     for(i in 1:listLength){
+         x <- dataList[[i]]
+         names(x) <- nameList[[i]]
+         namedDataList[[i]] <- x
+         }
+     namedDataList
+     }
> system.time(actTimeList <- splitWithNames(actTime, zoneNames, chainId))
[1] 8.04 0.00 9.03   NA   NA

However if I rewrite the function to use mapply instead of a for loop, it
again takes a long (undetermined) amount of time to complete. Here are the
results for just 5000  and 10000 records. You can see that there is a
scaling issue:
> testfun <- function(dataVector, nameVector, factorVector){
+     dataList <- split(dataVector, factorVector)
+     nameList <- split(nameVector, factorVector)
+     nameFun <- function(x, xNames){
+         names(x) <- xNames
+         x
+         }
+     mapply(nameFun, dataList, nameList, SIMPLIFY=TRUE)
+     }
> system.time(actTimeList <- testfun(actTime[1:5000], zoneNames[1:5000],
chainId[1:5000]))
[1] 2.99 0.00 2.98   NA   NA
> system.time(actTimeList <- testfun(actTime[1:10000], zoneNames[1:10000],
chainId[1:10000]))
[1] 10.64  0.00 10.64    NA    NA

My problem is solved for now with the home-brew splitWithNames function, but
I'm curious about why named vectors slow down split and tapply so much and
why a function using mapply is so much slower than a function that uses a
for loop?

My computer is a 800+ MHz Pentium III with 512 Mb of memory. The operating
system is Windows NT 4.0. My R version is 1.8.1.

Thank you.

Brian Gregor, P.E.
Transportation Planning Analysis Unit
Oregon Department of Transportation
Brian.J.GREGOR at odot.state.or.us
(503) 986-4120