[R] Improve code efficient with do.call, rbind and split contruction

Bert Gunter bgunter.4567 at gmail.com
Fri Sep 2 22:48:06 CEST 2016


Chuck:

I think this is quite clever. But note that the which() is
unnecessary: logical indicing suffices, e.g.

df[!duplicated(df[,c("f","g")],fromLast = TRUE),]

I thought that your approach would be faster because it moves
comparisons from the tapply() to C code. But I was wrong. e.g. for 1e6
rows:

> set.seed(1001)
> df <- data.frame(f =factor(sample(LETTERS[1:4],1e6,rep=TRUE)),
                   +                 g
=factor(sample(letters[1:6],1e6,rep=TRUE)),
                   +                 y = runif(1e6))

##using duplicated()
 > system.time(z <-df[!duplicated(df[,c("f","g")],fromLast = TRUE),])
user  system elapsed
0.175   0.008   0.183

## Using tapply()
 > system.time(
    + {ix <- seq_len(nrow(df));
    + z <- df[with(df,tapply(ix,list(f,g),function(x)x[length(x)])),]
    + })
user  system elapsed
0.025   0.003   0.028


This illustrates the faultiness of my "intuition."  A guess would be
that the subscripting to get the factor combinations and
duplicated.data.frame method takes the extra time.

Anyway...

Best,

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Fri, Sep 2, 2016 at 11:50 AM, Charles C. Berry <ccberry at ucsd.edu> wrote:
> On Fri, 2 Sep 2016, Bert Gunter wrote:
> [snip]
>>
>>
>> The "trick" is to use tapply() to select the necessary row indices of
>> your data frame and forget about all the do.call and rbind stuff. e.g.
>>
>
> I agree the way to go is "select the necessary row indices" but I get there
> a different way. See below.
>
>>> set.seed(1001)
>>> df <- data.frame(f =factor(sample(LETTERS[1:4],100,rep=TRUE)),
>>
>> +                  g <- factor(sample(letters[1:6],100,rep=TRUE)),
>> +                  y = runif(100))
>>>
>>>
>>> ix <- seq_len(nrow(df))
>>>
>>> ix <- with(df,tapply(ix,list(f,g),function(x)x[length(x)]))
>>> ix
>>
>>   a  b   c  d  e  f
>> A 94 69 100 59 80 87
>> B 89 57  65 90 75 88
>> C 85 92  86 95 97 62
>> D 47 73  72 74 99 96
>
>
>
>   jx <- which( !duplicated( df[,c("f","g")], fromLast=TRUE ))
>
>   xtabs(jx~f+g,df[jx,]) ## Show equivalence to Bert's `ix'
>
>    g
> f     a   b   c   d   e   f
>   A  94  69 100  59  80  87
>   B  89  57  65  90  75  88
>   C  85  92  86  95  97  62
>   D  47  73  72  74  99  96
>
>
> Chuck
>
>



More information about the R-help mailing list