[R] [FORGED] Splitting data.frame into a list of small data.frames given indices

Wed Jun 29 15:54:20 CEST 2016

Hi,

I don't really understand why you split every row... This makes it very 
slow. Try with a more realistic example (with a factor to split).

Ivan

--
Ivan Calandra, PhD
Scientific Mediator
University of Reims Champagne-Ardenne
GEGENAA - EA 3795
CREA - 2 esplanade Roland Garros
51100 Reims, France
+33(0)3 26 77 36 89
ivan.calandra at univ-reims.fr
--
https://www.researchgate.net/profile/Ivan_Calandra
https://publons.com/author/705639/

Le 29/06/2016 à 15:21, Witold E Wolski a écrit :
> Hi,
>
> Here is an complete example which shows the the complexity of split or
> by is O(n^2)
>
> nrows <- c(1e3,5e3, 1e4 ,5e4, 1e5 ,2e5)
> res<-list()
>
> for(i in nrows){
>    dum <- data.frame(x = runif(i,1,1000), y=runif(i,1,1000))
>    res[[length(res)+1]]<-(system.time(x<- split(dum, 1:nrow(dum))))
> }
> res <- do.call("rbind",res)
> plot(nrows^2, res[,"elapsed"])
>
> And I can't see a reason why this has to be so slow.
>
>
> cheers
>
>
>
>
>
>
>
> On 29 June 2016 at 12:00, Rolf Turner <r.turner at auckland.ac.nz> wrote:
>> On 29/06/16 21:16, Witold E Wolski wrote:
>>> It's the inverse problem to merging a list of data.frames into a large
>>> data.frame just discussed in the "performance of do.call("rbind")"
>>> thread
>>>
>>> I would like to split a data.frame into a list of data.frames
>>> according to first column.
>>> This SEEMS to be easily possible with the function base::by. However,
>>> as soon as the data.frame has a few million rows this function CAN NOT
>>> BE USED (except you have A PLENTY OF TIME).
>>>
>>> for 'by' runtime ~ nrow^2, or formally O(n^2)  (see benchmark below).
>>>
>>> So basically I am looking for a similar function with better complexity.
>>>
>>>
>>>   > nrows <- c(1e5,1e6,2e6,3e6,5e6)
>>>> timing <- list()
>>>> for(i in nrows){
>>> + dum <- peaks[1:i,]
>>> + timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3],
>>> INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE))
>>> + }
>>>> names(timing)<- nrows
>>>> timing
>>> $`1e+05`
>>>     user  system elapsed
>>>     0.05    0.00    0.05
>>>
>>> $`1e+06`
>>>     user  system elapsed
>>>     1.48    2.98    4.46
>>>
>>> $`2e+06`
>>>     user  system elapsed
>>>     7.25   11.39   18.65
>>>
>>> $`3e+06`
>>>     user  system elapsed
>>>    16.15   25.81   41.99
>>>
>>> $`5e+06`
>>>     user  system elapsed
>>>    43.22   74.72  118.09
>>
>> I'm not sure that I follow what you're doing, and your example is not
>> reproducible, since we have no idea what "peaks" is, but on a toy example
>> with 5e6 rows in the data frame I got a timing result of
>>
>>     user  system elapsed
>>    0.379 0.025 0.406
>>
>> when I applied split().  Is this adequately fast? Seems to me that if you
>> want to split something, split() would be a good place to start.
>>
>> cheers,
>>
>> Rolf Turner
>>
>> --
>> Technical Editor ANZJS
>> Department of Statistics
>> University of Auckland
>> Phone: +64-9-373-7599 ext. 88276
>
>