[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

Thu Jul 27 15:31:03 CEST 2006

Hi Developers,

I am looking for another new project to help me get more up to speed  
on R and to learn something outside of R internals.   One recent R  
issue I have run into is finding a fast implementations of the  
equivalent to the following SAS code:

/* MDPC is an integer sort key made from two integer columns */
MDPC = (MD * 100000) + PCO;

/* sort the dataset by the key */
PROC SORT;
   BY MDPC;

/* print out count and sum for each unique sort key (subgroup) */
/* use of BY MDPC requires sorting that data set by MDPC first in SAS */
PROC UNIVARIATE NOPRINT;
    VAR MVE;
    BY MDPC;
    OUTPUT OUT=TMP0 N=XN SUM=XSUM;

Easy to do in R but the problem is the data set this is being run on  
has 1,742,201 lines in it and takes up 196,868,713 bytes to store as  
character data.  The sort key has easily has over 200,000 unique keys  
(if not twice that).

My first R attempt was a simple

# sort the data.frame gd and the sort key
sorder <- order(MDPC)
gd  <- gd[sorder,]
MDPC <- MDPC[sorder]
attach(gd)

# find the length and sum for each unique sort key
XN <- by(MVE, MDPC, length)
XSUM <- by(MVE, MDPC, sum)
GRPS <- levels(as.factor(MDPC))

Well the ordering and sorting was reasonably fast but the first "by"  
statement was still running 4 hours later on my machine (a dual 2.6  
gig Opteron with 4 gig of main memory).  This same snippet of code in  
SAS running on a slower machine takes about 5 minutes of system time.

I tried various simple R implementations of a "by_sorted" that I  
thought might help

# walk sorted array once and keep track of beginning
# and ending points for each unique sort key value in p
# and run function fcn on that sub sequence in vector v
# store the results in a vector

by_sorted <- function(v, p, fcn) {
    key <- p[[1]]
    bp <- 1
    r <- NULL
    for (i in 2:length(p)) {
       if (key != p[[i]]) {
           r <- c(r,fcn(v[bp:i-1]))
           bp <- i
           key <- p[[i]]
       }
    }
    r <- c(r,fcn(v[bp:i]))
}

but they took "forever" to run also (read that I killed those  
attempts at 15 minutes of cpu time).

I literally had the same issue when trying with "tapply".

So unless it already exists someplace, I need a really fast  
implementation of "by" for very large sorted data sets (probably  
written in fortran or c) that will do the equivalent of what SAS does  
with its "proc univariate by" approach with close to the same  
performance.  The code should only have to walk the array once (ie.  
be linear in time with the number of rows in the vector).   I have  
similar issues with "merge" as well since merging data frames already  
sorted by the same sort key should be fast as well and does not  
appear to be.

Before I jump into this and create my own "sorted large data set"  
versions of "by" and "merge", I wanted to be sure it would be of  
interest to others.  If they work well and are well implemented (a  
big if since I am really still just learning this - the whole point  
of the project!) would something like this be of any interest for  
internal use of R?  Or is this something too specialized?
Is there an R function implemented in c or fortran that would make a  
good "model" to follow for implementing something like this?
Would/should they be extensions of current implementations of "merge"  
and "by" with an additions of a sorted=TRUE (defaulting to FALSE)  
extra parameter.

Or am I simply barking up the wrong tree here?

Thanks,

Kevin