[Rd] fast version of split.data.frame or conversion from data.frame to list of its rows
Prof Brian Ripley
ripley at stats.ox.ac.uk
Tue May 1 14:46:50 CEST 2012
On 01/05/2012 00:28, Antonio Piccolboni wrote:
> Hi,
> I was wondering if there is anything more efficient than split to do the
> kind of conversion in the subject. If I create a data frame as in
>
> system.time({fd = data.frame(x=1:2000, y = rnorm(2000), id = paste("x",
> 1:2000, sep =""))})
> user system elapsed
> 0.004 0.000 0.004
>
> and then I try to split it
>
>> system.time(split(fd, 1:nrow(fd)))
> user system elapsed
> 0.333 0.031 0.415
>
>
> You will be quick to notice the roughly two orders of magnitude difference
> in time between creation and conversion. Granted, it's not written anywhere
Unsurprising when you create three orders of magnitude more data frames,
is it? That's a list of 2000 data frames. Try
system.time(for(i in 1:2000) data.frame(x = i, y = rnorm(1), id =
paste0("x", i)))
> that they should be similar but the latter seems interpreter-slow to me
> (split is implemented with a lapply in the data frame case) There is also a
> memory issue when I hit about 20000 elements (allocating 3GB when
> interrupted). So before I resort to Rcpp, despite the electrifying feeling
> of approaching the bare metal and for the sake of getting things done, I
> thought I would ask the experts. Thanks
You need to re-think your data structures: 1-row data frames are not
sensible.
>
>
> Antonio
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-devel
mailing list