[R] SLOW split() function

Tue Oct 11 06:56:02 CEST 2011

thank you, everyone.  this was very helpful to my specific task and
understanding.  for the benefit of future googlers, I thought I would
post some experiments and results here.

ultimately, I need to do a by() on an irregular matrix, and I now know
how to speed up by() on a single-core, and then again on a multi-core
machine.

library(data.table)
N <- 1000*1000
d <- data.table(data.frame( key= as.integer(runif(N, min=1,
max=N/10)), x=rnorm(N), y=rnorm(N) ))  # irregular
setkey(d, "key"); gc() ## sort and force a garbage collection

cat("N=", N, ".  Size of d=", object.size(d)/1024/1024, "MB\n")

cat("\nStandard by() Function:\n")
print(system.time( all.1 <- by( d, d$key, function(d) coef(lm(y ~ x, data=d)))))

cat("\n\nPreSplit Function [aka Jim H]\n\t(a) Splitting Operation:\n")
print(system.time(si <- split(seq(nrow(d)), d$key)))
cat("\n\t(b) Regressions:\n")
print(system.time(all.2 <- lapply(si, function(.indx) {
coef(lm(d$y[.indx] ~ d$x[.indx])) })))
print(system.time(all.2b <- lapply(si, function(.indx) { coef(lm(y ~
x, data=d[.indx,])) })))

cat("\n\nNaive Split Data Frame\n\t(a) Splitting Operation:\n")
print(system.time(ds <- split(d, d$key)))
cat("\n\t(b) Regressions:\n")
print(system.time(all.3a <- lapply(ds, function(ds) { coef(lm(ds$y ~ ds$x)) })))
print(system.time(all.3b <- lapply(ds, function(ds) { coef(lm(y ~ x,
data=ds)) })))

the first and the last ways (all.1 and all.3) are "naive" ways of
doing this, and take about 400-500 seconds on a Mac Air, core i5.
Jim's suggestion (all.2) cuts this roughly into half by speeding up
the split to take almost no time.

and now,

library(multicore)
print(system.time(all.4 <- mclapply(si, function(.indx) { coef(lm(y ~
x, data=d[.indx,])) })))

on my dual-core (quad-thread) i5, all four pseudo cores become busy,
and the time roughly halves again from 230 seconds to 120 seconds.

maybe the by() function should use Jim's approach, and multicore
should provide mcby().  of course, knowing how to do this myself fast
now by hand, this is not so important for me.  but it may help some
other novices.

thanks again everybody.

regards,

/iaw

----
Ivo Welch (ivo.welch at gmail.com)

On Mon, Oct 10, 2011 at 9:31 PM, William Dunlap <wdunlap at tibco.com> wrote:
> The following avoids the overhead of data.frame methods
> (and assumes the data.frame doesn't include matrices
> or other data.frames) and relies on split(vector,factor)
> quickly splitting a vector into a list of vectors.
> For a 10^6 row by 10 column data.frame split in 10^5
> groups this took 14.1 seconds while split took 658.7 s.
> Both returned the same thing.
>
> Perhaps something based on this idea would help your
> parallelized by().
>
> mysplit.data.frame <-
> function (x, f, drop = FALSE, ...)
> {
>    f <- as.factor(f)
>    tmp <- lapply(x, function(xi) split(xi, f, drop = drop, ...))
>    rn <- split(rownames(x), f, drop = drop, ...)
>    tmp <- unlist(unname(tmp), recursive = FALSE)
>    tmp <- split(tmp, factor(names(tmp), levels = unique(names(tmp))))
>    tmp <- lapply(setNames(seq_along(tmp), names(tmp)), function(i) {
>        t <- tmp[[i]]
>        names(t) <- names(x)
>        attr(t, "row.names") <- rn[[i]]
>        class(t) <- "data.frame"
>        t
>    })
>    tmp
> }
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Jim Holtman
>> Sent: Monday, October 10, 2011 7:29 PM
>> To: ivo welch
>> Cc: r-help
>> Subject: Re: [R] SLOW split() function
>>
>> instead of spliting the entire dataframe, split the indices and then use these to access your data:
>> try
>>
>> system.time(s <- split(seq(nrow(d)), d$key))
>>
>> this should be faster and less memory intensive.  you can then use the indices to access the subset:
>>
>> result <- lapply(s, function(.indx){
>>     doSomething <- sum(d$someCol[.indx])
>> })
>>
>> Sent from my iPad
>>
>> On Oct 10, 2011, at 21:01, ivo welch <ivo.welch at gmail.com> wrote:
>>
>> > dear R experts:  apologies for all my speed and memory questions.  I
>> > have a bet with my coauthors that I can make R reasonably efficient
>> > through R-appropriate programming techniques.  this is not just for
>> > kicks, but for work.  for benchmarking, my [3 year old] Mac Pro has
>> > 2.8GHz Xeons, 16GB of RAM, and R 2.13.1.
>> >
>> > right now, it seems that 'split()' is why I am losing my bet.  (split
>> > is an integral component of *apply() and by(), so I need split() to be
>> > fast.  its resulting list can then be fed, e.g., to mclapply().)  I
>> > made up an example to illustrate my ills:
>> >
>> >    library(data.table)
>> >    N <- 1000
>> >    T <- N*10
>> >    d <- data.table(data.frame( key= rep(1:T, rep(N,T)), val=rnorm(N*T) ))
>> >    setkey(d, "key"); gc() ## force a garbage collection
>> >    cat("N=", N, ".  Size of d=", object.size(d)/1024/1024, "MB\n")
>> >    print(system.time( s<-split(d, d$key) ))
>> >
>> > My ordered input data table (or data frame; doesn't make a difference)
>> > is 114MB in size.  it takes about a second to create.  split() only
>> > needs to reshape it.  this simple operation takes almost 5 minutes on
>> > my computer.
>> >
>> > with a data set that is larger, this explodes further.
>> >
>> > am I doing something wrong?  is there an alternative to split()?
>> >
>> > sincerely,
>> >
>> > /iaw
>> >
>> > ----
>> > Ivo Welch (ivo.welch at gmail.com)
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>