[Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

Tue May 1 20:29:18 CEST 2012

On May 1, 2012, at 1:26 PM, Antonio Piccolboni <antonio at piccolboni.info> wrote:

> It seems like people need to hear more context, happy to provide it. I am
> implementing a serialization format (typedbytes, HADOOP-1722 if people want
> the gory details) to make R and Hadoop interoperate better (RHadoop
> project, package rmr). It is a row first format and it's already
> implemented as a C extension for R for lists and atomic vectors, where each
> element  of a vector is a row. I need to extend it to accept data frames
> and I was wondering if I can use the existing C code by converting a data
> frame to a list of its rows. It sounds like the answer is that it is not a
> good idea,

Just think about it -- data frames are lists of *columns* because the type of each column is fixed. Treating them row-wise is extremely inefficient, because you can't use any vector type to represent such thing (other than a generic vector containing vectors of length 1).

> that's helpful too in a way because it restricts the options. I
> thought I may be missing a simple primitive, like a t() for data frames
> (that doesn't coerce to matrix).

See above - I think you are misunderstanding data frames - t() makes no sense for data frames.

Cheers,
Simon

> On Tue, May 1, 2012 at 5:46 AM, Prof Brian Ripley <ripley at stats.ox.ac.uk>wrote:
> 
>> On 01/05/2012 00:28, Antonio Piccolboni wrote:
>> 
>>> Hi,
>>> I was wondering if there is anything more efficient than split to do the
>>> kind of conversion in the subject. If I create a data frame as in
>>> 
>>> system.time({fd =  data.frame(x=1:2000, y = rnorm(2000), id = paste("x",
>>> 1:2000, sep =""))})
>>>  user  system elapsed
>>>  0.004   0.000   0.004
>>> 
>>> and then I try to split it
>>> 
>>> system.time(split(fd, 1:nrow(fd)))
>>>> 
>>>   user  system elapsed
>>>  0.333   0.031   0.415
>>> 
>>> 
>>> You will be quick to notice the roughly two orders of magnitude difference
>>> in time between creation and conversion. Granted, it's not written
>>> anywhere
>>> 
>> 
>> Unsurprising when you create three orders of magnitude more data frames,
>> is it?  That's a list of 2000 data frames.  Try
>> 
>> system.time(for(i in 1:2000) data.frame(x = i, y = rnorm(1), id =
>> paste0("x", i)))
>> 
>> 
>> 
>> that they should be similar but the latter seems interpreter-slow to me
>>> (split is implemented with a lapply in the data frame case) There is also
>>> a
>>> memory issue when I hit about 20000 elements (allocating 3GB when
>>> interrupted). So before I resort to Rcpp, despite the electrifying feeling
>>> of approaching the bare metal and for the sake of getting things done, I
>>> thought I would ask the experts. Thanks
>>> 
>> 
>> You need to re-think your data structures: 1-row data frames are not
>> sensible.
>> 
>> 
>> 
>>> 
>>> Antonio
>>> 
>>>       [[alternative HTML version deleted]]
>>> 
>>> 
>>> ______________________________**________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/**listinfo/r-devel<https://stat.ethz.ch/mailman/listinfo/r-devel>
>>> 
>> 
>> 
>> --
>> Brian D. Ripley,                  ripley at stats.ox.ac.uk
>> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~**ripley/<http://www.stats.ox.ac.uk/~ripley/>
>> University of Oxford,             Tel:  +44 1865 272861 (self)
>> 1 South Parks Road,                     +44 1865 272866 (PA)
>> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
>