[R] SLOW split() function
William Dunlap
wdunlap at tibco.com
Tue Oct 11 06:31:58 CEST 2011
The following avoids the overhead of data.frame methods
(and assumes the data.frame doesn't include matrices
or other data.frames) and relies on split(vector,factor)
quickly splitting a vector into a list of vectors.
For a 10^6 row by 10 column data.frame split in 10^5
groups this took 14.1 seconds while split took 658.7 s.
Both returned the same thing.
Perhaps something based on this idea would help your
parallelized by().
mysplit.data.frame <-
function (x, f, drop = FALSE, ...)
{
f <- as.factor(f)
tmp <- lapply(x, function(xi) split(xi, f, drop = drop, ...))
rn <- split(rownames(x), f, drop = drop, ...)
tmp <- unlist(unname(tmp), recursive = FALSE)
tmp <- split(tmp, factor(names(tmp), levels = unique(names(tmp))))
tmp <- lapply(setNames(seq_along(tmp), names(tmp)), function(i) {
t <- tmp[[i]]
names(t) <- names(x)
attr(t, "row.names") <- rn[[i]]
class(t) <- "data.frame"
t
})
tmp
}
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Jim Holtman
> Sent: Monday, October 10, 2011 7:29 PM
> To: ivo welch
> Cc: r-help
> Subject: Re: [R] SLOW split() function
>
> instead of spliting the entire dataframe, split the indices and then use these to access your data:
> try
>
> system.time(s <- split(seq(nrow(d)), d$key))
>
> this should be faster and less memory intensive. you can then use the indices to access the subset:
>
> result <- lapply(s, function(.indx){
> doSomething <- sum(d$someCol[.indx])
> })
>
> Sent from my iPad
>
> On Oct 10, 2011, at 21:01, ivo welch <ivo.welch at gmail.com> wrote:
>
> > dear R experts: apologies for all my speed and memory questions. I
> > have a bet with my coauthors that I can make R reasonably efficient
> > through R-appropriate programming techniques. this is not just for
> > kicks, but for work. for benchmarking, my [3 year old] Mac Pro has
> > 2.8GHz Xeons, 16GB of RAM, and R 2.13.1.
> >
> > right now, it seems that 'split()' is why I am losing my bet. (split
> > is an integral component of *apply() and by(), so I need split() to be
> > fast. its resulting list can then be fed, e.g., to mclapply().) I
> > made up an example to illustrate my ills:
> >
> > library(data.table)
> > N <- 1000
> > T <- N*10
> > d <- data.table(data.frame( key= rep(1:T, rep(N,T)), val=rnorm(N*T) ))
> > setkey(d, "key"); gc() ## force a garbage collection
> > cat("N=", N, ". Size of d=", object.size(d)/1024/1024, "MB\n")
> > print(system.time( s<-split(d, d$key) ))
> >
> > My ordered input data table (or data frame; doesn't make a difference)
> > is 114MB in size. it takes about a second to create. split() only
> > needs to reshape it. this simple operation takes almost 5 minutes on
> > my computer.
> >
> > with a data set that is larger, this explodes further.
> >
> > am I doing something wrong? is there an alternative to split()?
> >
> > sincerely,
> >
> > /iaw
> >
> > ----
> > Ivo Welch (ivo.welch at gmail.com)
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list