[Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

Tue May 1 14:46:50 CEST 2012

On 01/05/2012 00:28, Antonio Piccolboni wrote:
> Hi,
> I was wondering if there is anything more efficient than split to do the
> kind of conversion in the subject. If I create a data frame as in
>
> system.time({fd =  data.frame(x=1:2000, y = rnorm(2000), id = paste("x",
> 1:2000, sep =""))})
>    user  system elapsed
>    0.004   0.000   0.004
>
> and then I try to split it
>
>> system.time(split(fd, 1:nrow(fd)))
>     user  system elapsed
>    0.333   0.031   0.415
>
>
> You will be quick to notice the roughly two orders of magnitude difference
> in time between creation and conversion. Granted, it's not written anywhere

Unsurprising when you create three orders of magnitude more data frames, 
is it?  That's a list of 2000 data frames.  Try

system.time(for(i in 1:2000) data.frame(x = i, y = rnorm(1), id = 
paste0("x", i)))

> that they should be similar but the latter seems interpreter-slow to me
> (split is implemented with a lapply in the data frame case) There is also a
> memory issue when I hit about 20000 elements (allocating 3GB when
> interrupted). So before I resort to Rcpp, despite the electrifying feeling
> of approaching the bare metal and for the sake of getting things done, I
> thought I would ask the experts. Thanks

You need to re-think your data structures: 1-row data frames are not 
sensible.

>
>
> Antonio
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595