[R] problem of data manipulation
Matthew Dowle
mdowle at mdowle.plus.com
Wed Jan 20 14:48:31 CET 2010
The user wrote in their first post :
> I have a lot of observations in my dataset
Heres one way to do it with a data.table :
a=data.table(a)
ans = a[ , list(dt=dt[dt-min(dt)<7]) , by="var1,var2,var3"]
class(ans$dt) = "Date"
Timings are below comparing the 3 methods. In this example, data.table
appears to be 28 times faster than plyr, and 24 times faster than sqldf. I
excluded the one off time to build the key, since thats realistic, but even
including that time, data.table is still 16 times faster than plyr (134 /
(1.03+2.16+4.71)). With even more rows, it should be even bigger speedups.
> a <- structure(list(var1 = structure(c(3L, 1L, 1L, 2L, 2L, 2L), .Label =
> c("c",
"n", "s"), class = "factor"), var2 = c(1L, 1L, 1L, 2L, 2L, 2L), var3 = c(2L,
2L, 2L, 1L, 1L, 1L), dt = structure(c(10592, 10997, 11000, 10998,
11002, 11010), class = "Date")), .Names = c("var1", "var2", "var3",
"dt"), row.names = c(NA, -6L), class = "data.frame")
>
> a = data.frame(lapply(a,function(x)rep(x,each=1000000)))
> dim(a)
[1] 6000000 4
> library(plyr)
> system.time({ans1 <<- ddply(a, c("var1", "var2", "var3"), subset, dt -
> min(dt) < 7)})
user system elapsed
131.39 3.11 134.80
> library(sqldf)
> system.time({ans2 <<- sqldf("select var1, var2, var3, dt from a, (select
> var1, var2, var3, min(dt) mindt from a group by var1, var2, var3)
> using(var1, var2, var3) where dt - mindt < 7")})
user system elapsed
110.26 2.24 113.32
> mapply(identical,ans1,ans2[order(ans2$var1),])
var1 var2 var3 dt
TRUE TRUE TRUE TRUE
>
> library(data.table)
> system.time({adt<<-data.table(a)})
user system elapsed
0.90 0.13 1.03
> system.time({setkey(adt,var1,var2,var3)})
user system elapsed
1.89 0.27 2.16
> system.time({ans3 <<-
> adt[,list(dt=dt[dt-min(dt)<7]),by="var1,var2,var3"]})
user system elapsed
3.92 0.78 4.71
> class(ans3$dt) = "Date"
> mapply(identical,ans1,ans3)
var1 var2 var3 dt
TRUE TRUE TRUE TRUE
Note that in the documentaton ?"[.data.table" where I say that 'by' is slow,
I mean relative to how fast it could be. Its seems, in this specific
example anyway, and with the code posted so far, to be significantly faster
than sqldf and plyr.
"Gabor Grothendieck" <ggrothendieck at gmail.com> wrote in message
news:971536df1001191350x3bd5d982j9879e05453760062 at mail.gmail.com...
> Using data frame, a, from the post below this is how it would be done
> in SQL using sqldf. We join together the original table, a, with a
> table of minimums (computed by the nested select) and then choose only
> the rows where dt - mindt < 7 (in the where clause).
>
>> library(sqldf)
>> sqldf("select var1, var2, var3, dt from a, (select var1, var2, var3,
>> min(dt) mindt from a group by var1, var2, var3) using(var1, var2, var3)
>> where dt - mindt < 7")
> var1 var2 var3 dt
> 1 s 1 2 1999-01-01
> 2 c 1 2 2000-02-10
> 3 c 1 2 2000-02-13
> 4 n 2 1 2000-02-11
> 5 n 2 1 2000-02-15
>
>
> On Tue, Jan 19, 2010 at 4:22 PM, hadley wickham <h.wickham at gmail.com>
> wrote:
>> On Mon, Jan 18, 2010 at 1:54 PM, Bert Gunter <gunter.berton at gene.com>
>> wrote:
>>> One way to do it:
>>>
>>> 1. Convert your date column to the Date class using the as.Date()
>>> function.
>>> This allows you to do the necessary arithmetic on the dates below.
>>> dt <- as.Date(a[,4],"%d/%m/%Y")
>>>
>>> 2. Create a factor out of your first three columns whose levels are in
>>> the
>>> same order as the unique rows. Something likes the following should do
>>> it:
>>> fac <- do.call(paste,a[,-4])
>>> fac <- factor(fac, levels=unique(fac))
>>>
>>> This allows you to choose the groups of rows whose dates you wish to
>>> compare
>>> and maintain their correct order in the data frame
>>>
>>> 3. Then use tapply:
>>> a[unlist(tapply(dt,fac,function(x)x-min(x) < 7)),]
>>>
>>> (unlist is needed to remove the list structure and concatenate the
>>> logical
>>> indices to obtain the subscripting vector).
>>
>> Here's the same basic approach with the plyr package:
>>
>> a <- structure(list(var1 = structure(c(3L, 1L, 1L, 2L, 2L, 2L), .Label =
>> c("c",
>> "n", "s"), class = "factor"), var2 = c(1, 1, 1, 2, 2, 2), var3 = c(2,
>> 2, 2, 1, 1, 1), dt = structure(c(10592, 10997, 11000, 10998,
>> 11002, 11010), class = "Date")), .Names = c("var1", "var2", "var3",
>> "dt"), row.names = c(NA, -6L), class = "data.frame")
>>
>> library(plyr)
>> ddply(a, c("var1", "var2", "var3"), subset, dt - min(dt) < 7)
>>
>> Hadley
>
More information about the R-help
mailing list