[Rd] reshape scaling with large numbers of times/rows
Mitch Skinner
lists at arctur.us
Thu Aug 24 08:06:03 CEST 2006
Hello,
I'm mailing r-devel because I think the performance problem I'm having
is best solved by re-writing reshape() in C. I've been reading the
"writing R extensions" documentation and I have some questions about
they best way to write the C bits, but first let me describe my problem:
I'm trying to reshape a long data frame with ~70 subjects measured at
~4500 "times" (in my case, it's ~4500 genetic loci on a chromosome) into
a wide data frame with one column per locus ("time"). On my data
(~300,000 rows for chromosome 1) it takes about twenty minutes on a
3.4GHz P4. Here's an R session that demonstrates it (this is pretty
close to how my data actually looks):
> version
_
platform i686-redhat-linux-gnu
arch i686
os linux-gnu
system i686, linux-gnu
status
major 2
minor 3.1
year 2006
month 06
day 01
svn rev 38247
language R
version.string Version 2.3.1 (2006-06-01)
> test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) +
1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))),
genotype=I(as.character(as.integer(runif(3e5, 1, 100)))))
> system.time(testWide <- reshape(test, v.names=c("genotype"),
timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
[1] 1107.506 120.755 1267.568 0.000 0.000
I believe that this can be done a lot faster, and I think the problem
comes from this loop in reshape.R (in reshapeWide):
for(i in seq(length = length(times))) {
thistime <- data[data[,timevar] %in% times[i],]
rval[,varying[,i]] <-
thistime[match(rval[,idvar],thistime[,idvar]),
v.names]
}
I don't really understand everything going on under the hood here, but I
believe the complexity of this loop is something like
O(length(times)*(nrow(data)+(nrow(rval)*length(varying))). The profile
shows the lion's share (90%) of the time being spent in [.data.frame.
What I'd like to do is write a C loop to go through the source (long)
data frame once (rather than once per time), and put the values into the
destination rows/columns using hash tables to look up the right
row/column.
Assuming the hashing is constant-time bounded, then the reshape becomes
O(nrow(data)*length(varying)).
I'd like to use the abitrary-R-vector hashing functionality from
unique.c, but I'm not sure how to do it. I think that functionality is
useful to other C code, but the functions that I'm interested in are not
part of the R api (they're defined with "static"). Assuming copy/paste
is out, I can see two options: 1. to remove "static" from the
declarations of some of the functions, and add prototypes for those
functions to a new src/library/stats/src/reshape.c file (or to one of
the header files in src/include), or 2. to add C functions to do the
reshaping to src/main/unique.c and call those from
src/library/stats/R/reshape.R.
This is all assuming that it makes sense to include this in mainline
R--obviously I think it's worthwhile, and I'm surprised other people
aren't complaining more. I would be happy to write/rewrite until y'all
are happy with how it's done, of course.
I've written a proof-of-concept by copying and pasting the hashing
functions, which (on the above data frame) runs 20 times faster than the
R version of reshape. It still needs some debugging, to put it mildly
(doesn't work properly on reals), but the basic idea appears to work.
The change I made to the reshape R function looks like this:
=====================
for(i in seq(length = length(times))) {
- thistime <- data[data[,timevar] %in% times[i],]
- rval[,varying[,i]] <-
thistime[match(rval[,idvar],thistime[,idvar]),
- v.names]
+ for (j in seq(length(v.names))) {
+ rval[,varying[j,i]] <-
rep(vector(mode=typeof(data[[v.names[j]]]), 0),
+ length.out=nrow(rval))
+ }
}
+ colMap <- match(varying, names(rval))
+ .Call("do_reshape_wide",
+ data[[idvar]], data[[timevar]], data[v.names],
+ rval, colMap,
+ v.names, times, rval[[idvar]])
+
if (!is.null(new.row.names))
row.names(rval) <- new.row.names
=====================
This part:
rep(vector(mode=typeof(data[[v.names[j]]]), 0), length.out=nrow(rval))
is to initialize the output with appropriately-typed vectors full of
NAs; if there's a better/right way to do this please let me know.
The do_reshape_wide C function looks like this:
=====================
SEXP do_reshape_wide(SEXP longIds, SEXP longTimes, SEXP longData,
SEXP wideFrame, SEXP colMap,
SEXP vnames, SEXP times, SEXP wideIds) {
HashData idHash, timeHash;
int longRows, numVarying;
int rowNum, varying;
int wideRow, curTime;
SEXP wideCol, longCol;
HashTableSetup(wideIds, &idHash);
PROTECT(idHash.HashTable);
DoHashing(wideIds, &idHash);
HashTableSetup(times, &timeHash);
PROTECT(timeHash.HashTable);
DoHashing(times, &timeHash);
longRows = length(longIds);
numVarying = length(vnames);
for (rowNum = 0; rowNum < longRows; rowNum++) {
/* Lookup returns 1-based answers */
wideRow = Lookup(wideIds, longIds, rowNum, &idHash) - 1;
curTime = Lookup(times, longTimes, rowNum, &timeHash) - 1;
for (varying = 0; varying < numVarying; varying++) {
/* colMap is 1-based */
wideCol = VECTOR_ELT(wideFrame,
INTEGER(colMap)[(numVarying * curTime)
+ varying] - 1);
longCol = VECTOR_ELT(longData, varying);
SET_VECTOR_ELT(wideCol, wideRow, VECTOR_ELT(longCol,
rowNum));
}
}
UNPROTECT(2);
}
=====================
None examples I recall from "Writing R extensions" had void C function
return types; is that a rule?
I've put the code up here:
http://arctur.us/r/creshape.tar.gz
I've never tried to make a package before, but you should be able to
just do a R CMD INSTALL on it, if you want to try it out. Here's an R
session using that code:
> require(creshape)
Loading required package: creshape
[1] TRUE
> test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) +
1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))),
genotype=I(as.character(as.integer(runif(3e5, 1, 100)))))
> system.time(testWide <- creshape(test, v.names=c("genotype"),
timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
[1] 60.756 1.540 62.598 0.000 0.000
> system.time(testWide <- reshape(test, v.names=c("genotype"),
timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
[1] 1278.231 78.389 1387.739 0.000 0.000
Any comments are appreciated,
Mitch
More information about the R-devel
mailing list