[Rd] reshape scaling with large numbers of times/rows
Gabor Grothendieck
ggrothendieck at gmail.com
Thu Aug 24 09:16:02 CEST 2006
If for each combination of X and Y there is at most one Z and Z is numeric
(it could be made so in your example) then you could use xtabs which is
faster:
> m <- n <- 10
> DF <- data.frame(X = gl(m*n, 1), Y = gl(m, n), Z = 10*(1:(n*m)))
> system.time(w1 <- reshape(DF, timevar = "X", idvar = "Y", dir = "wide"))
[1] 0.33 0.00 0.33 NA NA
> system.time(w2 <- xtabs(Z ~ Y + X, DF))
[1] 0.01 0.00 0.01 NA NA
On 8/24/06, Mitch Skinner <lists at arctur.us> wrote:
> Hello,
>
> I'm mailing r-devel because I think the performance problem I'm having
> is best solved by re-writing reshape() in C. I've been reading the
> "writing R extensions" documentation and I have some questions about
> they best way to write the C bits, but first let me describe my problem:
>
> I'm trying to reshape a long data frame with ~70 subjects measured at
> ~4500 "times" (in my case, it's ~4500 genetic loci on a chromosome) into
> a wide data frame with one column per locus ("time"). On my data
> (~300,000 rows for chromosome 1) it takes about twenty minutes on a
> 3.4GHz P4. Here's an R session that demonstrates it (this is pretty
> close to how my data actually looks):
>
> > version
> _
> platform i686-redhat-linux-gnu
> arch i686
> os linux-gnu
> system i686, linux-gnu
> status
> major 2
> minor 3.1
> year 2006
> month 06
> day 01
> svn rev 38247
> language R
> version.string Version 2.3.1 (2006-06-01)
> > test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) +
> 1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))),
> genotype=I(as.character(as.integer(runif(3e5, 1, 100)))))
> > system.time(testWide <- reshape(test, v.names=c("genotype"),
> timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
> [1] 1107.506 120.755 1267.568 0.000 0.000
>
> I believe that this can be done a lot faster, and I think the problem
> comes from this loop in reshape.R (in reshapeWide):
>
> for(i in seq(length = length(times))) {
> thistime <- data[data[,timevar] %in% times[i],]
> rval[,varying[,i]] <-
> thistime[match(rval[,idvar],thistime[,idvar]),
> v.names]
> }
>
> I don't really understand everything going on under the hood here, but I
> believe the complexity of this loop is something like
> O(length(times)*(nrow(data)+(nrow(rval)*length(varying))). The profile
> shows the lion's share (90%) of the time being spent in [.data.frame.
>
> What I'd like to do is write a C loop to go through the source (long)
> data frame once (rather than once per time), and put the values into the
> destination rows/columns using hash tables to look up the right
> row/column.
>
> Assuming the hashing is constant-time bounded, then the reshape becomes
> O(nrow(data)*length(varying)).
>
> I'd like to use the abitrary-R-vector hashing functionality from
> unique.c, but I'm not sure how to do it. I think that functionality is
> useful to other C code, but the functions that I'm interested in are not
> part of the R api (they're defined with "static"). Assuming copy/paste
> is out, I can see two options: 1. to remove "static" from the
> declarations of some of the functions, and add prototypes for those
> functions to a new src/library/stats/src/reshape.c file (or to one of
> the header files in src/include), or 2. to add C functions to do the
> reshaping to src/main/unique.c and call those from
> src/library/stats/R/reshape.R.
>
> This is all assuming that it makes sense to include this in mainline
> R--obviously I think it's worthwhile, and I'm surprised other people
> aren't complaining more. I would be happy to write/rewrite until y'all
> are happy with how it's done, of course.
>
> I've written a proof-of-concept by copying and pasting the hashing
> functions, which (on the above data frame) runs 20 times faster than the
> R version of reshape. It still needs some debugging, to put it mildly
> (doesn't work properly on reals), but the basic idea appears to work.
>
> The change I made to the reshape R function looks like this:
> =====================
> for(i in seq(length = length(times))) {
> - thistime <- data[data[,timevar] %in% times[i],]
> - rval[,varying[,i]] <-
> thistime[match(rval[,idvar],thistime[,idvar]),
> - v.names]
> + for (j in seq(length(v.names))) {
> + rval[,varying[j,i]] <-
> rep(vector(mode=typeof(data[[v.names[j]]]), 0),
> + length.out=nrow(rval))
> + }
> }
>
> + colMap <- match(varying, names(rval))
> + .Call("do_reshape_wide",
> + data[[idvar]], data[[timevar]], data[v.names],
> + rval, colMap,
> + v.names, times, rval[[idvar]])
> +
> if (!is.null(new.row.names))
> row.names(rval) <- new.row.names
> =====================
>
> This part:
> rep(vector(mode=typeof(data[[v.names[j]]]), 0), length.out=nrow(rval))
> is to initialize the output with appropriately-typed vectors full of
> NAs; if there's a better/right way to do this please let me know.
>
> The do_reshape_wide C function looks like this:
> =====================
> SEXP do_reshape_wide(SEXP longIds, SEXP longTimes, SEXP longData,
> SEXP wideFrame, SEXP colMap,
> SEXP vnames, SEXP times, SEXP wideIds) {
> HashData idHash, timeHash;
> int longRows, numVarying;
> int rowNum, varying;
> int wideRow, curTime;
> SEXP wideCol, longCol;
>
> HashTableSetup(wideIds, &idHash);
> PROTECT(idHash.HashTable);
> DoHashing(wideIds, &idHash);
>
> HashTableSetup(times, &timeHash);
> PROTECT(timeHash.HashTable);
> DoHashing(times, &timeHash);
>
> longRows = length(longIds);
> numVarying = length(vnames);
>
> for (rowNum = 0; rowNum < longRows; rowNum++) {
> /* Lookup returns 1-based answers */
> wideRow = Lookup(wideIds, longIds, rowNum, &idHash) - 1;
> curTime = Lookup(times, longTimes, rowNum, &timeHash) - 1;
>
> for (varying = 0; varying < numVarying; varying++) {
> /* colMap is 1-based */
> wideCol = VECTOR_ELT(wideFrame,
> INTEGER(colMap)[(numVarying * curTime)
> + varying] - 1);
> longCol = VECTOR_ELT(longData, varying);
>
> SET_VECTOR_ELT(wideCol, wideRow, VECTOR_ELT(longCol,
> rowNum));
> }
> }
> UNPROTECT(2);
> }
> =====================
>
> None examples I recall from "Writing R extensions" had void C function
> return types; is that a rule?
>
> I've put the code up here:
> http://arctur.us/r/creshape.tar.gz
>
> I've never tried to make a package before, but you should be able to
> just do a R CMD INSTALL on it, if you want to try it out. Here's an R
> session using that code:
> > require(creshape)
> Loading required package: creshape
> [1] TRUE
> > test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) +
> 1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))),
> genotype=I(as.character(as.integer(runif(3e5, 1, 100)))))
> > system.time(testWide <- creshape(test, v.names=c("genotype"),
> timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
> [1] 60.756 1.540 62.598 0.000 0.000
> > system.time(testWide <- reshape(test, v.names=c("genotype"),
> timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
> [1] 1278.231 78.389 1387.739 0.000 0.000
>
> Any comments are appreciated,
> Mitch
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
More information about the R-devel
mailing list