[R] SLOW split() function

Tue Oct 11 06:31:58 CEST 2011

The following avoids the overhead of data.frame methods
(and assumes the data.frame doesn't include matrices
or other data.frames) and relies on split(vector,factor)
quickly splitting a vector into a list of vectors.
For a 10^6 row by 10 column data.frame split in 10^5
groups this took 14.1 seconds while split took 658.7 s.
Both returned the same thing.

Perhaps something based on this idea would help your
parallelized by().

mysplit.data.frame <-
function (x, f, drop = FALSE, ...)
{
    f <- as.factor(f)
    tmp <- lapply(x, function(xi) split(xi, f, drop = drop, ...))
    rn <- split(rownames(x), f, drop = drop, ...)
    tmp <- unlist(unname(tmp), recursive = FALSE)
    tmp <- split(tmp, factor(names(tmp), levels = unique(names(tmp))))
    tmp <- lapply(setNames(seq_along(tmp), names(tmp)), function(i) {
        t <- tmp[[i]]
        names(t) <- names(x)
        attr(t, "row.names") <- rn[[i]]
        class(t) <- "data.frame"
        t
    })
    tmp
} 

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Jim Holtman
> Sent: Monday, October 10, 2011 7:29 PM
> To: ivo welch
> Cc: r-help
> Subject: Re: [R] SLOW split() function
> 
> instead of spliting the entire dataframe, split the indices and then use these to access your data:
> try
> 
> system.time(s <- split(seq(nrow(d)), d$key))
> 
> this should be faster and less memory intensive.  you can then use the indices to access the subset:
> 
> result <- lapply(s, function(.indx){
>     doSomething <- sum(d$someCol[.indx])
> })
> 
> Sent from my iPad
> 
> On Oct 10, 2011, at 21:01, ivo welch <ivo.welch at gmail.com> wrote:
> 
> > dear R experts:  apologies for all my speed and memory questions.  I
> > have a bet with my coauthors that I can make R reasonably efficient
> > through R-appropriate programming techniques.  this is not just for
> > kicks, but for work.  for benchmarking, my [3 year old] Mac Pro has
> > 2.8GHz Xeons, 16GB of RAM, and R 2.13.1.
> >
> > right now, it seems that 'split()' is why I am losing my bet.  (split
> > is an integral component of *apply() and by(), so I need split() to be
> > fast.  its resulting list can then be fed, e.g., to mclapply().)  I
> > made up an example to illustrate my ills:
> >
> >    library(data.table)
> >    N <- 1000
> >    T <- N*10
> >    d <- data.table(data.frame( key= rep(1:T, rep(N,T)), val=rnorm(N*T) ))
> >    setkey(d, "key"); gc() ## force a garbage collection
> >    cat("N=", N, ".  Size of d=", object.size(d)/1024/1024, "MB\n")
> >    print(system.time( s<-split(d, d$key) ))
> >
> > My ordered input data table (or data frame; doesn't make a difference)
> > is 114MB in size.  it takes about a second to create.  split() only
> > needs to reshape it.  this simple operation takes almost 5 minutes on
> > my computer.
> >
> > with a data set that is larger, this explodes further.
> >
> > am I doing something wrong?  is there an alternative to split()?
> >
> > sincerely,
> >
> > /iaw
> >
> > ----
> > Ivo Welch (ivo.welch at gmail.com)
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.