[Rd] reshape scaling with large numbers of times/rows

Thu Aug 24 09:16:02 CEST 2006

If for each combination of X and Y there is at most one Z and Z is numeric
(it could be made so in your example) then you could use xtabs which is
faster:

> m <- n <- 10
> DF <- data.frame(X = gl(m*n, 1), Y = gl(m, n), Z = 10*(1:(n*m)))
> system.time(w1 <- reshape(DF, timevar = "X", idvar = "Y", dir = "wide"))
[1] 0.33 0.00 0.33   NA   NA
> system.time(w2 <- xtabs(Z ~ Y + X, DF))
[1] 0.01 0.00 0.01   NA   NA

On 8/24/06, Mitch Skinner <lists at arctur.us> wrote:
> Hello,
>
> I'm mailing r-devel because I think the performance problem I'm having
> is best solved by re-writing reshape() in C.  I've been reading the
> "writing R extensions" documentation and I have some questions about
> they best way to write the C bits, but first let me describe my problem:
>
> I'm trying to reshape a long data frame with ~70 subjects measured at
> ~4500 "times" (in my case, it's ~4500 genetic loci on a chromosome) into
> a wide data frame with one column per locus ("time").  On my data
> (~300,000 rows for chromosome 1) it takes about twenty minutes on a
> 3.4GHz P4.  Here's an R session that demonstrates it (this is pretty
> close to how my data actually looks):
>
> > version
>               _
> platform       i686-redhat-linux-gnu
> arch           i686
> os             linux-gnu
> system         i686, linux-gnu
> status
> major          2
> minor          3.1
> year           2006
> month          06
> day            01
> svn rev        38247
> language       R
> version.string Version 2.3.1 (2006-06-01)
> > test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) +
> 1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))),
> genotype=I(as.character(as.integer(runif(3e5, 1, 100)))))
> > system.time(testWide <- reshape(test, v.names=c("genotype"),
> timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
> [1] 1107.506  120.755 1267.568    0.000    0.000
>
> I believe that this can be done a lot faster, and I think the problem
> comes from this loop in reshape.R (in reshapeWide):
>
>        for(i in seq(length = length(times))) {
>            thistime <- data[data[,timevar] %in% times[i],]
>            rval[,varying[,i]] <-
> thistime[match(rval[,idvar],thistime[,idvar]),
> v.names]
>        }
>
> I don't really understand everything going on under the hood here, but I
> believe the complexity of this loop is something like
> O(length(times)*(nrow(data)+(nrow(rval)*length(varying))).  The profile
> shows the lion's share (90%) of the time being spent in [.data.frame.
>
> What I'd like to do is write a C loop to go through the source (long)
> data frame once (rather than once per time), and put the values into the
> destination rows/columns using hash tables to look up the right
> row/column.
>
> Assuming the hashing is constant-time bounded, then the reshape becomes
> O(nrow(data)*length(varying)).
>
> I'd like to use the abitrary-R-vector hashing functionality from
> unique.c, but I'm not sure how to do it.  I think that functionality is
> useful to other C code, but the functions that I'm interested in are not
> part of the R api (they're defined with "static").  Assuming copy/paste
> is out, I can see two options: 1. to remove "static" from the
> declarations of some of the functions, and add prototypes for those
> functions to a new src/library/stats/src/reshape.c file (or to one of
> the header files in src/include), or 2. to add C functions to do the
> reshaping to src/main/unique.c and call those from
> src/library/stats/R/reshape.R.
>
> This is all assuming that it makes sense to include this in mainline
> R--obviously I think it's worthwhile, and I'm surprised other people
> aren't complaining more.  I would be happy to write/rewrite until y'all
> are happy with how it's done, of course.
>
> I've written a proof-of-concept by copying and pasting the hashing
> functions, which (on the above data frame) runs 20 times faster than the
> R version of reshape.  It still needs some debugging, to put it mildly
> (doesn't work properly on reals), but the basic idea appears to work.
>
> The change I made to the reshape R function looks like this:
> =====================
>         for(i in seq(length = length(times))) {
> -            thistime <- data[data[,timevar] %in% times[i],]
> -            rval[,varying[,i]] <-
> thistime[match(rval[,idvar],thistime[,idvar]),
> -                                           v.names]
> +          for (j in seq(length(v.names))) {
> +            rval[,varying[j,i]] <-
> rep(vector(mode=typeof(data[[v.names[j]]]), 0),
> +                                       length.out=nrow(rval))
> +          }
>         }
>
> +        colMap <- match(varying, names(rval))
> +        .Call("do_reshape_wide",
> +              data[[idvar]], data[[timevar]], data[v.names],
> +              rval, colMap,
> +              v.names, times, rval[[idvar]])
> +
>         if (!is.null(new.row.names))
>             row.names(rval) <- new.row.names
> =====================
>
> This part:
> rep(vector(mode=typeof(data[[v.names[j]]]), 0), length.out=nrow(rval))
> is to initialize the output with appropriately-typed vectors full of
> NAs; if there's a better/right way to do this please let me know.
>
> The do_reshape_wide C function looks like this:
> =====================
> SEXP do_reshape_wide(SEXP longIds, SEXP longTimes, SEXP longData,
>                     SEXP wideFrame, SEXP colMap,
>                     SEXP vnames, SEXP times, SEXP wideIds) {
>    HashData idHash, timeHash;
>    int longRows, numVarying;
>    int rowNum, varying;
>    int wideRow, curTime;
>    SEXP wideCol, longCol;
>
>    HashTableSetup(wideIds, &idHash);
>    PROTECT(idHash.HashTable);
>    DoHashing(wideIds, &idHash);
>
>    HashTableSetup(times, &timeHash);
>    PROTECT(timeHash.HashTable);
>    DoHashing(times, &timeHash);
>
>    longRows = length(longIds);
>    numVarying = length(vnames);
>
>    for (rowNum = 0; rowNum < longRows; rowNum++) {
>        /* Lookup returns 1-based answers */
>        wideRow = Lookup(wideIds, longIds, rowNum, &idHash) - 1;
>        curTime = Lookup(times, longTimes, rowNum, &timeHash) - 1;
>
>        for (varying = 0; varying < numVarying; varying++) {
>            /* colMap is 1-based */
>            wideCol = VECTOR_ELT(wideFrame,
>                                 INTEGER(colMap)[(numVarying * curTime)
>                                                 + varying] - 1);
>            longCol = VECTOR_ELT(longData, varying);
>
>            SET_VECTOR_ELT(wideCol, wideRow, VECTOR_ELT(longCol,
> rowNum));
>        }
>    }
>    UNPROTECT(2);
> }
> =====================
>
> None examples I recall from "Writing R extensions" had void C function
> return types; is that a rule?
>
> I've put the code up here:
> http://arctur.us/r/creshape.tar.gz
>
> I've never tried to make a package before, but you should be able to
> just do a R CMD INSTALL on it, if you want to try it out.  Here's an R
> session using that code:
> > require(creshape)
> Loading required package: creshape
> [1] TRUE
> > test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) +
> 1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))),
> genotype=I(as.character(as.integer(runif(3e5, 1, 100)))))
> > system.time(testWide <- creshape(test, v.names=c("genotype"),
> timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
> [1] 60.756  1.540 62.598  0.000  0.000
> > system.time(testWide <- reshape(test, v.names=c("genotype"),
> timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
> [1] 1278.231   78.389 1387.739    0.000    0.000
>
> Any comments are appreciated,
> Mitch
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>