[R] help: program efficiency

Romain Francois romain at r-enthusiasts.com
Fri Nov 26 21:13:53 CET 2010


Hello,

Can we really make the assumption that the data is sorted. The original 
example was not:

> I am working on a function to make a duplicated value unique. For example,
> the original vector would be like : a = c(2,1,1,3,3,3,4)

If we can make the assumption, here is a C++ based version:


nodup_cpp_assumingsorted <- cxxfunction( signature( x_ = "numeric" ), '

     // since we modify x, we need to make a copy
     NumericVector x = clone<NumericVector>(x_);

     int n = x.size() ;
     double current, previous = x[0] ;
     int index ;
     for( int i=1; i<n; i++){
         current = x[i] ;
         if( current == previous ){
             x[i] = current + (++index) / 100.0 ;
         } else {
             index = 0 ;
         }
         previous = current ;
     }
     return x ;
', plugin = "Rcpp" )


with these results:

 > x <- sort( sample( 1:100000, size = 300000, replace = TRUE ) )

 > system.time( nodup3( x ) )
utilisateur     système      écoulé
       0.090       0.004       0.094
 > system.time( nodup3a( x ) )
utilisateur     système      écoulé
       0.036       0.005       0.040
 > system.time( nodup4( x ) )
utilisateur     système      écoulé
       0.025       0.004       0.029
 > system.time( nodup_cpp_assumingsorted( x) )
utilisateur     système      écoulé
       0.003       0.001       0.004



Now, if we don't make the assumption that the data is sorted, here is 
another C++ based version:

require( inline )
require( Rcpp )
nodup_cpp <- cxxfunction( signature( x_ = "numeric" ), '

     // since we modify x, we need to make a copy
     NumericVector x = clone<NumericVector>(x_);

     typedef std::map<double,int> imap ;
     typedef imap::value_type pair ;
     imap index ;
     int n = x.size() ;
     double current, previous = x[0] ;
     index.insert( pair( previous, 0 ) );

     imap::iterator it = index.begin() ;
     for( int i=1; i<n; i++){
         current = x[i] ;
         if( current == previous ){
             x[i] = current + ( ++(it->second) / 100.0 ) ;
         } else {
             it = index.find(current) ;
             if( it == index.end() ){
                 it = index.insert(
                     current > previous ? it : index.begin(),
                     pair( current, 0 )
                     ) ;
             } else {
                 x[i] = current + ( ++(it->second) / 100.0 ) ;
             }
              previous = current ;
         }
     }
     return x ;
', plugin = "Rcpp" )


which gives me this :

 > x <- sample( 1:100000, size = 300000, replace = TRUE )
 >
 > system.time( nodup_cpp( x ) )
utilisateur     système      écoulé
       0.111       0.002       0.113
 > system.time( nodup3( sort( x ) ) )
utilisateur     système      écoulé
       0.162       0.011       0.172
 > system.time( nodup3a( sort( x ) ) )
utilisateur     système      écoulé
       0.099       0.009       0.109
 > system.time( nodup4( sort( x ) ) )
utilisateur     système      écoulé
       0.089       0.004       0.094

so nodup4 is still faster, but the values are not in the right order:

 > x <- c( 2, 1, 1, 2 )
 > nodup4( sort( x ) )
[1] 1.00 1.01 2.00 2.01
 > nodup_cpp( x )
[1] 2.00 1.00 1.01 2.01

Romain

Le 26/11/10 20:01, William Dunlap a écrit :
>
>> -----Original Message-----
>> From: William Dunlap
>> Sent: Thursday, November 25, 2010 9:31 AM
>> To: 'randomcz'; r-help at r-project.org
>> Subject: RE: [R] help: program efficiency
>>
>> If the input vector t is known to be ordered
>> (or if you only care about runs of duplicated
>> values, not all duplicated values) the following
>> is pretty quick
>>
>> nodup3<- function (t) {
>>      t + (sequence(rle(t)$lengths) - 1)/100
>> }
>>
>> If you don't know if the the input will be ordered
>> then ave() will do it a bit faster than your
>> code
>>
>> nodup2<- function (t) {
>>      ave(t, t, FUN = function(x) x + (seq_along(x) - 1)/100)
>> }
>>
>> E.g., for a sorted sequence of 300,000 numbers drawn with
>> replacement from 1:100,000 I get:
>>
>>> a2<- sort(sample(1:1e5, size=3e5, replace=TRUE))
>>> system.time(v<- nodup(a2))
>>     user  system elapsed
>>     2.78    0.05    3.97
>>> system.time(v2<- nodup2(a2))
>>     user  system elapsed
>>     1.83    0.02    2.66
>>> system.time(v3<- nodup3(a2))
>>     user  system elapsed
>>     0.18    0.00    0.14
>>> identical(v,v2)&&  identical(v,v3)
>> [1] TRUE
>>
>> If speed is truly an issue, the built-in sequence may
>> be replaced by a faster one that does the same thing:
>>
>> nodup3a<- function (t) {
>>      faster.sequence<- function(nvec) {
>>          seq_len(sum(nvec)) - rep(cumsum(c(0L, nvec[-length(nvec)])),
>>              nvec)
>>      }
>>      t + (faster.sequence(rle(t)$lengths) - 1)/100
>> }
>>
>> That took 0.05 seconds on the a2 dataset and produced
>> identical results.
>
> rle() computes a sort of second difference and
> nodup3a computes a cumsum on that second diffence,
> to get back to a first difference.  The following
> avoids that wasted operation (along with rle's
> computation of the values component of its output).
>
> nodup4<- function(t) {
>      n<- length(t)
>      p<- c(0L, which(t[-1L] != t[-n]), n)
>      t + ( seq_len(n) - rep.int(p[-length(p)] + 1L, diff(p)) ) /100
> }
>
> That reduced nodup3a's time by about 30% on that dataset.
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>>> -----Original Message-----
>>> From: r-help-bounces at r-project.org
>>> [mailto:r-help-bounces at r-project.org] On Behalf Of randomcz
>>> Sent: Thursday, November 25, 2010 6:49 AM
>>> To: r-help at r-project.org
>>> Subject: [R] help: program efficiency
>>>
>>>
>>> hey guys,
>>>
>>> I am working on a function to make a duplicated value unique.
>>> For example,
>>> the original vector would be like : a = c(2,1,1,3,3,3,4)
>>> I'll like to transform it into:
>>> a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4
>>> basically, find the duplicates and assign a unique value by
>>> adding a small
>>> amount and keep it in order.
>>> I come up with the following codes, but it runs slow if t is
>>> large. Is there
>>> a better way to do it?
>>> nodup = function(t)
>>> {
>>>    t.index=0
>>>    t.dup=duplicated(t)
>>>    for (i in 2:length(t))
>>>    {
>>>      if (t.dup[i]==T)
>>>        t.index=t.index+0.01
>>>      else t.index=0
>>>      t[i]=t[i]+t.index
>>>    }
>>>    return(t)
>>> }
>>>
>>>
>>> --
>>> View this message in context:
>>> http://r.789695.n4.nabble.com/help-program-efficiency-tp305907
>> 9p3059079.html
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>


-- 
Romain Francois
Professional R Enthusiast
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr
|- http://bit.ly/9VOd3l : ZAT! 2010
|- http://bit.ly/c6DzuX : Impressionnism with R
`- http://bit.ly/czHPM7 : Rcpp Google tech talk on youtube



More information about the R-help mailing list