[Rd] reshape scaling with large numbers of times/rows

Thu Aug 24 08:06:03 CEST 2006

Hello,

I'm mailing r-devel because I think the performance problem I'm having
is best solved by re-writing reshape() in C.  I've been reading the
"writing R extensions" documentation and I have some questions about
they best way to write the C bits, but first let me describe my problem:

I'm trying to reshape a long data frame with ~70 subjects measured at
~4500 "times" (in my case, it's ~4500 genetic loci on a chromosome) into
a wide data frame with one column per locus ("time").  On my data
(~300,000 rows for chromosome 1) it takes about twenty minutes on a
3.4GHz P4.  Here's an R session that demonstrates it (this is pretty
close to how my data actually looks):

> version
               _
platform       i686-redhat-linux-gnu
arch           i686
os             linux-gnu
system         i686, linux-gnu
status
major          2
minor          3.1
year           2006
month          06
day            01
svn rev        38247
language       R
version.string Version 2.3.1 (2006-06-01)
> test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) +
1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))),
genotype=I(as.character(as.integer(runif(3e5, 1, 100)))))
> system.time(testWide <- reshape(test, v.names=c("genotype"),
timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
[1] 1107.506  120.755 1267.568    0.000    0.000

I believe that this can be done a lot faster, and I think the problem
comes from this loop in reshape.R (in reshapeWide):

        for(i in seq(length = length(times))) {
            thistime <- data[data[,timevar] %in% times[i],]
            rval[,varying[,i]] <-
thistime[match(rval[,idvar],thistime[,idvar]),
v.names]
        }

I don't really understand everything going on under the hood here, but I
believe the complexity of this loop is something like
O(length(times)*(nrow(data)+(nrow(rval)*length(varying))).  The profile
shows the lion's share (90%) of the time being spent in [.data.frame.

What I'd like to do is write a C loop to go through the source (long)
data frame once (rather than once per time), and put the values into the
destination rows/columns using hash tables to look up the right
row/column.

Assuming the hashing is constant-time bounded, then the reshape becomes
O(nrow(data)*length(varying)).

I'd like to use the abitrary-R-vector hashing functionality from
unique.c, but I'm not sure how to do it.  I think that functionality is
useful to other C code, but the functions that I'm interested in are not
part of the R api (they're defined with "static").  Assuming copy/paste
is out, I can see two options: 1. to remove "static" from the
declarations of some of the functions, and add prototypes for those
functions to a new src/library/stats/src/reshape.c file (or to one of
the header files in src/include), or 2. to add C functions to do the
reshaping to src/main/unique.c and call those from
src/library/stats/R/reshape.R.

This is all assuming that it makes sense to include this in mainline
R--obviously I think it's worthwhile, and I'm surprised other people
aren't complaining more.  I would be happy to write/rewrite until y'all
are happy with how it's done, of course.

I've written a proof-of-concept by copying and pasting the hashing
functions, which (on the above data frame) runs 20 times faster than the
R version of reshape.  It still needs some debugging, to put it mildly
(doesn't work properly on reals), but the basic idea appears to work.

The change I made to the reshape R function looks like this:
=====================
         for(i in seq(length = length(times))) {
-            thistime <- data[data[,timevar] %in% times[i],]
-            rval[,varying[,i]] <-
thistime[match(rval[,idvar],thistime[,idvar]),
-                                           v.names]
+          for (j in seq(length(v.names))) {
+            rval[,varying[j,i]] <-
rep(vector(mode=typeof(data[[v.names[j]]]), 0),
+                                       length.out=nrow(rval))
+          }
         }

+        colMap <- match(varying, names(rval))
+        .Call("do_reshape_wide",
+              data[[idvar]], data[[timevar]], data[v.names],
+              rval, colMap,
+              v.names, times, rval[[idvar]])
+
         if (!is.null(new.row.names))
             row.names(rval) <- new.row.names
=====================

This part:
rep(vector(mode=typeof(data[[v.names[j]]]), 0), length.out=nrow(rval))
is to initialize the output with appropriately-typed vectors full of
NAs; if there's a better/right way to do this please let me know.

The do_reshape_wide C function looks like this:
=====================
SEXP do_reshape_wide(SEXP longIds, SEXP longTimes, SEXP longData, 
                     SEXP wideFrame, SEXP colMap,
                     SEXP vnames, SEXP times, SEXP wideIds) {
    HashData idHash, timeHash;
    int longRows, numVarying;
    int rowNum, varying;
    int wideRow, curTime;
    SEXP wideCol, longCol;

    HashTableSetup(wideIds, &idHash);
    PROTECT(idHash.HashTable);
    DoHashing(wideIds, &idHash);

    HashTableSetup(times, &timeHash);
    PROTECT(timeHash.HashTable);
    DoHashing(times, &timeHash);

    longRows = length(longIds);
    numVarying = length(vnames);

    for (rowNum = 0; rowNum < longRows; rowNum++) {
        /* Lookup returns 1-based answers */
        wideRow = Lookup(wideIds, longIds, rowNum, &idHash) - 1;
        curTime = Lookup(times, longTimes, rowNum, &timeHash) - 1;

        for (varying = 0; varying < numVarying; varying++) {
            /* colMap is 1-based */
            wideCol = VECTOR_ELT(wideFrame, 
                                 INTEGER(colMap)[(numVarying * curTime)
                                                 + varying] - 1);
            longCol = VECTOR_ELT(longData, varying);

            SET_VECTOR_ELT(wideCol, wideRow, VECTOR_ELT(longCol,
rowNum));
        }
    }
    UNPROTECT(2);
}
=====================

None examples I recall from "Writing R extensions" had void C function
return types; is that a rule?

I've put the code up here:
http://arctur.us/r/creshape.tar.gz

I've never tried to make a package before, but you should be able to
just do a R CMD INSTALL on it, if you want to try it out.  Here's an R
session using that code:
> require(creshape)
Loading required package: creshape
[1] TRUE
> test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) +
1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))),
genotype=I(as.character(as.integer(runif(3e5, 1, 100)))))
> system.time(testWide <- creshape(test, v.names=c("genotype"),
timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
[1] 60.756  1.540 62.598  0.000  0.000
> system.time(testWide <- reshape(test, v.names=c("genotype"),
timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
[1] 1278.231   78.389 1387.739    0.000    0.000

Any comments are appreciated,
Mitch