[R] problem of data manipulation

Gabor Grothendieck ggrothendieck at gmail.com
Tue Jan 19 22:50:43 CET 2010


Using data frame, a, from the post below this is how it would be done
in SQL using sqldf.  We join together the original table, a,  with a
table of minimums (computed by the nested select) and then choose only
the rows where dt - mindt < 7 (in the where clause).

> library(sqldf)
> sqldf("select var1, var2, var3, dt from a, (select var1, var2, var3, min(dt) mindt from a group by var1, var2, var3) using(var1, var2, var3) where dt - mindt < 7")
  var1 var2 var3         dt
1    s    1    2 1999-01-01
2    c    1    2 2000-02-10
3    c    1    2 2000-02-13
4    n    2    1 2000-02-11
5    n    2    1 2000-02-15


On Tue, Jan 19, 2010 at 4:22 PM, hadley wickham <h.wickham at gmail.com> wrote:
> On Mon, Jan 18, 2010 at 1:54 PM, Bert Gunter <gunter.berton at gene.com> wrote:
>> One way to do it:
>>
>> 1. Convert your date column to the Date class using the as.Date() function.
>> This allows you to do the necessary arithmetic on the dates below.
>> dt <- as.Date(a[,4],"%d/%m/%Y")
>>
>> 2. Create a factor out of your first three columns whose levels are in the
>> same order as the unique rows. Something likes the following should do it:
>> fac <- do.call(paste,a[,-4])
>> fac <- factor(fac, levels=unique(fac))
>>
>> This allows you to choose the groups of rows whose dates you wish to compare
>> and maintain their correct order in the data frame
>>
>> 3. Then use tapply:
>> a[unlist(tapply(dt,fac,function(x)x-min(x) < 7)),]
>>
>> (unlist is needed to remove the list structure and concatenate the logical
>> indices to obtain the subscripting vector).
>
> Here's the same basic approach with the plyr package:
>
> a <- structure(list(var1 = structure(c(3L, 1L, 1L, 2L, 2L, 2L), .Label = c("c",
> "n", "s"), class = "factor"), var2 = c(1, 1, 1, 2, 2, 2), var3 = c(2,
> 2, 2, 1, 1, 1), dt = structure(c(10592, 10997, 11000, 10998,
> 11002, 11010), class = "Date")), .Names = c("var1", "var2", "var3",
> "dt"), row.names = c(NA, -6L), class = "data.frame")
>
> library(plyr)
> ddply(a, c("var1", "var2", "var3"), subset, dt - min(dt) < 7)
>
> Hadley



More information about the R-help mailing list