[R] problem of data manipulation

Wed Jan 20 14:48:31 CET 2010

The user wrote in their first post :
> I have a lot of observations in my dataset

Heres one way to do it with a data.table :
         a=data.table(a)
         ans = a[ , list(dt=dt[dt-min(dt)<7]) , by="var1,var2,var3"]
         class(ans$dt) = "Date"

Timings are below comparing the 3 methods. In this example, data.table 
appears to be 28 times faster than plyr, and 24 times faster than sqldf.  I 
excluded the one off time to build the key, since thats realistic, but even 
including that time, data.table is still 16 times faster than plyr (134 / 
(1.03+2.16+4.71)).  With even more rows, it should be even bigger speedups.

> a <- structure(list(var1 = structure(c(3L, 1L, 1L, 2L, 2L, 2L), .Label = 
> c("c",
"n", "s"), class = "factor"), var2 = c(1L, 1L, 1L, 2L, 2L, 2L), var3 = c(2L,
2L, 2L, 1L, 1L, 1L), dt = structure(c(10592, 10997, 11000, 10998,
11002, 11010), class = "Date")), .Names = c("var1", "var2", "var3",
"dt"), row.names = c(NA, -6L), class = "data.frame")
>
> a = data.frame(lapply(a,function(x)rep(x,each=1000000)))
> dim(a)
[1] 6000000       4
> library(plyr)
> system.time({ans1 <<- ddply(a, c("var1", "var2", "var3"), subset, dt - 
> min(dt) < 7)})
   user  system elapsed
 131.39    3.11  134.80
> library(sqldf)
> system.time({ans2 <<- sqldf("select var1, var2, var3, dt from a, (select 
> var1, var2, var3, min(dt) mindt from a group by var1, var2, var3) 
> using(var1, var2, var3) where dt - mindt < 7")})
   user  system elapsed
 110.26    2.24  113.32
> mapply(identical,ans1,ans2[order(ans2$var1),])
var1 var2 var3   dt
TRUE TRUE TRUE TRUE
>
> library(data.table)
> system.time({adt<<-data.table(a)})
   user  system elapsed
   0.90    0.13    1.03
> system.time({setkey(adt,var1,var2,var3)})
   user  system elapsed
   1.89    0.27    2.16
> system.time({ans3 <<- 
> adt[,list(dt=dt[dt-min(dt)<7]),by="var1,var2,var3"]})
   user  system elapsed
   3.92    0.78    4.71
> class(ans3$dt) = "Date"
> mapply(identical,ans1,ans3)
var1 var2 var3   dt
TRUE TRUE TRUE TRUE

Note that in the documentaton ?"[.data.table" where I say that 'by' is slow, 
I mean relative to how fast it could be.  Its seems, in this specific 
example anyway, and with the code posted so far, to be significantly faster 
than sqldf and plyr.

"Gabor Grothendieck" <ggrothendieck at gmail.com> wrote in message 
news:971536df1001191350x3bd5d982j9879e05453760062 at mail.gmail.com...
> Using data frame, a, from the post below this is how it would be done
> in SQL using sqldf.  We join together the original table, a,  with a
> table of minimums (computed by the nested select) and then choose only
> the rows where dt - mindt < 7 (in the where clause).
>
>> library(sqldf)
>> sqldf("select var1, var2, var3, dt from a, (select var1, var2, var3, 
>> min(dt) mindt from a group by var1, var2, var3) using(var1, var2, var3) 
>> where dt - mindt < 7")
>  var1 var2 var3         dt
> 1    s    1    2 1999-01-01
> 2    c    1    2 2000-02-10
> 3    c    1    2 2000-02-13
> 4    n    2    1 2000-02-11
> 5    n    2    1 2000-02-15
>
>
> On Tue, Jan 19, 2010 at 4:22 PM, hadley wickham <h.wickham at gmail.com> 
> wrote:
>> On Mon, Jan 18, 2010 at 1:54 PM, Bert Gunter <gunter.berton at gene.com> 
>> wrote:
>>> One way to do it:
>>>
>>> 1. Convert your date column to the Date class using the as.Date() 
>>> function.
>>> This allows you to do the necessary arithmetic on the dates below.
>>> dt <- as.Date(a[,4],"%d/%m/%Y")
>>>
>>> 2. Create a factor out of your first three columns whose levels are in 
>>> the
>>> same order as the unique rows. Something likes the following should do 
>>> it:
>>> fac <- do.call(paste,a[,-4])
>>> fac <- factor(fac, levels=unique(fac))
>>>
>>> This allows you to choose the groups of rows whose dates you wish to 
>>> compare
>>> and maintain their correct order in the data frame
>>>
>>> 3. Then use tapply:
>>> a[unlist(tapply(dt,fac,function(x)x-min(x) < 7)),]
>>>
>>> (unlist is needed to remove the list structure and concatenate the 
>>> logical
>>> indices to obtain the subscripting vector).
>>
>> Here's the same basic approach with the plyr package:
>>
>> a <- structure(list(var1 = structure(c(3L, 1L, 1L, 2L, 2L, 2L), .Label = 
>> c("c",
>> "n", "s"), class = "factor"), var2 = c(1, 1, 1, 2, 2, 2), var3 = c(2,
>> 2, 2, 1, 1, 1), dt = structure(c(10592, 10997, 11000, 10998,
>> 11002, 11010), class = "Date")), .Names = c("var1", "var2", "var3",
>> "dt"), row.names = c(NA, -6L), class = "data.frame")
>>
>> library(plyr)
>> ddply(a, c("var1", "var2", "var3"), subset, dt - min(dt) < 7)
>>
>> Hadley
>