[R] operations between two aggregated data frames?

Gabor Grothendieck ggrothendieck at gmail.com
Sat May 15 00:44:24 CEST 2010


And here is a pure R solution:

> m <- merge(df1, df2, by = "category")
> m$datediff <- m$date.x - m$date.y
> m <- m[order(m$category, m$date.x, m$date.y), ]
> m
   category A.x     date.x A.y     date.y   datediff
2         1 124 2003-02-08  28 2003-05-17   -98 days
1         1 124 2003-02-08 116 2003-11-29  -294 days
6         1  22 2008-08-16  28 2003-05-17  1918 days
5         1  22 2008-08-16 116 2003-11-29  1722 days
4         1  96 2008-11-29  28 2003-05-17  2023 days
3         1  96 2008-11-29 116 2003-11-29  1827 days
10        2  18 2001-12-01  12 2005-02-26 -1183 days
9         2  18 2001-12-01   6 2008-10-25 -2520 days
8         2 150 2002-01-12  12 2005-02-26 -1141 days
7         2 150 2002-01-12   6 2008-10-25 -2478 days
14        3  24 2003-09-13 109 2005-10-01  -749 days
13        3  24 2003-09-13  92 2007-08-18 -1435 days
12        3 175 2009-08-01 109 2005-10-01  1400 days
11        3 175 2009-08-01  92 2007-08-18   714 days
24        4 126 2000-11-04  65 2000-11-18   -14 days
26        4 126 2000-11-04  91 2003-05-10  -917 days
25        4 126 2000-11-04  15 2003-07-26  -994 days
23        4 126 2000-11-04  54 2008-11-22 -2940 days
16        4  70 2004-03-13  65 2000-11-18  1211 days
18        4  70 2004-03-13  91 2003-05-10   308 days
17        4  70 2004-03-13  15 2003-07-26   231 days
15        4  70 2004-03-13  54 2008-11-22 -1715 days
20        4  64 2007-06-02  65 2000-11-18  2387 days
22        4  64 2007-06-02  91 2003-05-10  1484 days
21        4  64 2007-06-02  15 2003-07-26  1407 days
19        4  64 2007-06-02  54 2008-11-22  -539 days


On Fri, May 14, 2010 at 6:38 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> Generating df1 and df2 as in your post try this (and see
> http://sqldf.googlecode.com for more info):
>
>> library(sqldf)
>> out <- sqldf("select category,
> + df1.date date1,
> + df2.date date2,
> + df1.date - df2.date datediff
> + from df1 join df2 using(category)
> + order by category, date1, date2")
>>
>> out[[2]] <- as.Date(out[[2]], origin = "1970-01-01")
>> out[[3]] <- as.Date(out[[3]], origin = "1970-01-01")
>> out
>   category      date1      date2 datediff
> 1         1 2003-02-08 2003-05-17      -98
> 2         1 2003-02-08 2003-11-29     -294
> 3         1 2008-08-16 2003-05-17     1918
> 4         1 2008-08-16 2003-11-29     1722
> 5         1 2008-11-29 2003-05-17     2023
> 6         1 2008-11-29 2003-11-29     1827
> 7         2 2001-12-01 2005-02-26    -1183
> 8         2 2001-12-01 2008-10-25    -2520
> 9         2 2002-01-12 2005-02-26    -1141
> 10        2 2002-01-12 2008-10-25    -2478
> 11        3 2003-09-13 2005-10-01     -749
> 12        3 2003-09-13 2007-08-18    -1435
> 13        3 2009-08-01 2005-10-01     1400
> 14        3 2009-08-01 2007-08-18      714
> 15        4 2000-11-04 2000-11-18      -14
> 16        4 2000-11-04 2003-05-10     -917
> 17        4 2000-11-04 2003-07-26     -994
> 18        4 2000-11-04 2008-11-22    -2940
> 19        4 2004-03-13 2000-11-18     1211
> 20        4 2004-03-13 2003-05-10      308
> 21        4 2004-03-13 2003-07-26      231
> 22        4 2004-03-13 2008-11-22    -1715
> 23        4 2007-06-02 2000-11-18     2387
> 24        4 2007-06-02 2003-05-10     1484
> 25        4 2007-06-02 2003-07-26     1407
> 26        4 2007-06-02 2008-11-22     -539
>
> On Fri, May 14, 2010 at 6:13 PM, Jonathan <jonsleepy at gmail.com> wrote:
>> Hi All,
>>   I've come up with a solution for this problem that relies on a for loop,
>> and I was wondering if anybody had any insight into a more elegant method:
>>
>> I have two data frames, each has a column for categorical data and a column
>> for date.  What I'd like to do, ideally, is calculate the number of days
>> between all pairs of dates in data frame 1 and data frame 2 (*but only for
>> members of the same category*).  The number of members of each category
>> varies between the two data frames.
>>
>> For example:
>>
>>
>>> d <- seq(as.Date("2000-02-12"), as.Date("2009-08-18"), by="weeks")
>>
>>> df1 <- data.frame('A'=sample(1:200,10), 'date'=d[sample(1:length(d),10)],'category'=sample(1:4,10,replace=TRUE))
>>
>>> df2 <- data.frame('A'=sample(1:200,10), 'date'=d[sample(1:length(d),10)],'category'=sample(1:4,10,replace=TRUE))
>>
>>
>>> df1
>>     A       date category
>> 1   93 2004-02-28        3
>> 2  105 2001-03-17        3
>> 3  189 2009-07-04        2
>> 4  130 2003-07-05        2
>> 5  160 2005-09-24        2
>> 6   32 2004-11-06        2
>> 7  117 2007-03-17        1
>> 8  161 2003-07-19        4
>> 9  153 2001-09-15        3
>> 10 173 2005-08-27        1
>>
>>
>>> df2
>>     A       date category
>> 1  102 2006-08-19        3
>> 2   68 2004-11-27        2
>> 3  137 2003-01-11        1
>> 4   39 2002-12-28        2
>> 5  127 2004-03-06        4
>> 6  125 2002-02-23        2
>> 7  150 2002-05-18        4
>> 8   19 2003-02-22        1
>> 9   80 2000-08-05        1
>> 10  94 2003-12-27        1
>>
>>
>> Within a loop, I'd do the following (i is my counter; for the example,
>> I set it to 1):
>>
>>
>>> i<-1
>>
>> # Create the data frames:
>>
>>> yeari_1 <- df1[which(df1['category']==i),]; yeari_2 <- df2[which(df2['category']==i),]
>>
>> # Select only the data from category i
>>
>>> yeari_1
>>     A       date category
>> 7  117 2007-03-17        1
>> 10 173 2005-08-27        1
>>
>>> yeari_2
>>     A       date category
>> 3  137 2003-01-11        1
>> 8   19 2003-02-22        1
>> 9   80 2000-08-05        1
>> 10  94 2003-12-27        1
>>
>> # Convert dates to integers
>>
>> year1_i[[2]] <- as.integer(as.Date(yeari_1[[2]])); yeari_2[[2]] <-
>> as.integer(as.Date(yeari_2[[2]]));
>>
>>> yeari_1
>>     A  date category
>> 7  117 13589        1
>> 10 173 13022        1
>>> yeari_2
>>     A  date category
>> 3  137 12063        1
>> 8   19 12105        1
>> 9   80 11174        1
>> 10  94 12413        1
>>
>> # Get differences of all pairs:
>>
>>> result <- outer(yeari_1[[2]],yeari_2[[2]],'-')
>>> result
>>     [,1] [,2] [,3] [,4]
>> [1,] 1526 1484 2415 1176
>> [2,]  959  917 1848  609
>>
>> # Now, merge the results with the results from all the earlier
>> iterations for previous values of i, increment i to the next value,
>> and repeat.
>>
>>
>> ----
>>
>> Ideally, I could accomplish this in some sort of vectorized manner,
>> although the Force is not yet strong with me.  Any ideas would be
>> appreciated!
>>
>>
>> Regards,
>>
>> Jonathan
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>



More information about the R-help mailing list