# [Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

Kevin B. Hendricks kevin.hendricks at sympatico.ca
Thu Jul 27 15:31:03 CEST 2006

```Hi Developers,

I am looking for another new project to help me get more up to speed
on R and to learn something outside of R internals.   One recent R
issue I have run into is finding a fast implementations of the
equivalent to the following SAS code:

/* MDPC is an integer sort key made from two integer columns */
MDPC = (MD * 100000) + PCO;

/* sort the dataset by the key */
PROC SORT;
BY MDPC;

/* print out count and sum for each unique sort key (subgroup) */
/* use of BY MDPC requires sorting that data set by MDPC first in SAS */
PROC UNIVARIATE NOPRINT;
VAR MVE;
BY MDPC;
OUTPUT OUT=TMP0 N=XN SUM=XSUM;

Easy to do in R but the problem is the data set this is being run on
has 1,742,201 lines in it and takes up 196,868,713 bytes to store as
character data.  The sort key has easily has over 200,000 unique keys
(if not twice that).

My first R attempt was a simple

# sort the data.frame gd and the sort key
sorder <- order(MDPC)
gd  <- gd[sorder,]
MDPC <- MDPC[sorder]
attach(gd)

# find the length and sum for each unique sort key
XN <- by(MVE, MDPC, length)
XSUM <- by(MVE, MDPC, sum)
GRPS <- levels(as.factor(MDPC))

Well the ordering and sorting was reasonably fast but the first "by"
statement was still running 4 hours later on my machine (a dual 2.6
gig Opteron with 4 gig of main memory).  This same snippet of code in
SAS running on a slower machine takes about 5 minutes of system time.

I tried various simple R implementations of a "by_sorted" that I
thought might help

# walk sorted array once and keep track of beginning
# and ending points for each unique sort key value in p
# and run function fcn on that sub sequence in vector v
# store the results in a vector

by_sorted <- function(v, p, fcn) {
key <- p[]
bp <- 1
r <- NULL
for (i in 2:length(p)) {
if (key != p[[i]]) {
r <- c(r,fcn(v[bp:i-1]))
bp <- i
key <- p[[i]]
}
}
r <- c(r,fcn(v[bp:i]))
}

but they took "forever" to run also (read that I killed those
attempts at 15 minutes of cpu time).

I literally had the same issue when trying with "tapply".

So unless it already exists someplace, I need a really fast
implementation of "by" for very large sorted data sets (probably
written in fortran or c) that will do the equivalent of what SAS does
with its "proc univariate by" approach with close to the same
performance.  The code should only have to walk the array once (ie.
be linear in time with the number of rows in the vector).   I have
similar issues with "merge" as well since merging data frames already
sorted by the same sort key should be fast as well and does not
appear to be.

Before I jump into this and create my own "sorted large data set"
versions of "by" and "merge", I wanted to be sure it would be of
interest to others.  If they work well and are well implemented (a
big if since I am really still just learning this - the whole point
of the project!) would something like this be of any interest for
internal use of R?  Or is this something too specialized?
Is there an R function implemented in c or fortran that would make a
good "model" to follow for implementing something like this?
Would/should they be extensions of current implementations of "merge"
and "by" with an additions of a sorted=TRUE (defaulting to FALSE)
extra parameter.

Or am I simply barking up the wrong tree here?

Thanks,

Kevin

```