[R] Removing rows with earlier dates

Wed Dec 29 16:24:17 CET 2010

On Dec 29, 2010, at 9:24 AM, Ali Salekfard wrote:

> Thanks to everyone. Joshua's response seemed the most concise one,  
> but it
> used up so much memory that my R just gave error. I checked the other
> replies and all in all I came up with this, and thought to share it  
> with
> others and get comments.
>
> My structure was as follows:
>
> ACCOUNT   RULE  DATE
> A1             xxxx     2010-01-01
> A2             xxxx     2007-05-01
> A2             xxxx     2007-05-01
> A2             xxxx     2005-05-01
> A2             xxxx     2005-05-01
> A1             xxxx     2009-01-01
>
> The most efficient solution I came across involves the following  
> steps:
>
> 1. Find the latest date for each account, and convert it to a data  
> frame:
>
> a<-tapply(my.mapping$DATE,my.mapping$ACCOUNT,max)
> a<-data.frame(ACCOUNT=names(a),DT=as.Date(a,"%Y-%m-%d"))
> 2. merge the set with the original data
>
> my.mapping<-merge(x=my.mapping,y=a,by.x="ACCOUNT",by.y="ACCOUNT")
>
> 3. Create a take column, which is to confirm if the date of the row  
> is the
> maximum date for the account.
> my.mapping<-cbind(my.mapping,TAKE=my.mapping$DATE==my.mapping$DT)
> 4. Filter out all lines except those with TAKE==TRUE.
>
> my.mapping<-my.mapping[my.mapping$TAKE==TRUE,]
> The running time for my whole list was 4.5 sec which is far better  
> than any
> other ways I tried. Let me have your thoughts on that.

My first thought is that you should use more spaces in your code. It  
looks quite a bit more complex than the method I suggested (and my  
benchmark says mine was maybe 50% faster, but with Maechler's  
improvements is now about 4 times faster. I guess I shouldn't throw  
too many stones about coding style.)

my.mapping[ with(my.mapping, DATE == ave( DATE,
                                           ACCOUNT,
                                           FUN=max} ), ]
#------------------
require(rbenchmark)
ave.method = function(df, acc, dt)
    {df[with( df, dt == ave(dt, acc, FUN=max)), ]}
merge.method = function(df, acc, dt) {
    a<- tapply(df[[dt]], df[[acc]],max)
    a  <- data.frame(ACCOUNT=names(a), DT=a)
    df <- merge(x=df, y=a, by.x=acc, by.y="ACCOUNT")
    df <- cbind(df, TAKE=df[dt]==df$DT)
df <- df[df$TAKE==TRUE,]}
benchmark(
    rep=ave.method(airquality, "Month", "Day"),
    pat=merge.method(airquality, "Month", "Day"),
    replications=1000,
    order=c('replications', 'elapsed'))
#-----------------
   test replications elapsed relative user.self sys.self user.child  
sys.child
1  rep         1000   2.523 1.000000     2.512    0.018           
0         0
2  pat         1000   7.847 3.110186     7.773    0.092           
0         0

It does give the same answers when tested on airquality, though. That  
says something for it I suppose. (Had you offered a sensible test  
dataset in your first posting , I would have offered a solution using  
your column names, but as it was I figured you should have been able  
to make the mappings.)

-- 
David.

>
> Ali

David Winsemius, MD
West Hartford, CT