[R] help: program efficiency

Romain Francois romain at r-enthusiasts.com
Fri Nov 26 21:22:08 CET 2010

```Le 26/11/10 21:13, Romain Francois a écrit :
>
> Hello,
>
> Can we really make the assumption that the data is sorted. The original
> example was not:
>
>> I am working on a function to make a duplicated value unique. For
>> example,
>> the original vector would be like : a = c(2,1,1,3,3,3,4)
>
> If we can make the assumption, here is a C++ based version:
>
>
> nodup_cpp_assumingsorted <- cxxfunction( signature( x_ = "numeric" ), '
>
> // since we modify x, we need to make a copy
> NumericVector x = clone<NumericVector>(x_);
>
> int n = x.size() ;
> double current, previous = x[0] ;
> int index ;
> for( int i=1; i<n; i++){
> current = x[i] ;
> if( current == previous ){
> x[i] = current + (++index) / 100.0 ;
> } else {
> index = 0 ;
> }
> previous = current ;
> }
> return x ;
> ', plugin = "Rcpp" )
>
>
> with these results:
>
>  > x <- sort( sample( 1:100000, size = 300000, replace = TRUE ) )
>
>  > system.time( nodup3( x ) )
> utilisateur système écoulé
> 0.090 0.004 0.094
>  > system.time( nodup3a( x ) )
> utilisateur système écoulé
> 0.036 0.005 0.040
>  > system.time( nodup4( x ) )
> utilisateur système écoulé
> 0.025 0.004 0.029
>  > system.time( nodup_cpp_assumingsorted( x) )
> utilisateur système écoulé
> 0.003 0.001 0.004
>
>
>
> Now, if we don't make the assumption that the data is sorted, here is
> another C++ based version:
>
> require( inline )
> require( Rcpp )
> nodup_cpp <- cxxfunction( signature( x_ = "numeric" ), '
>
> // since we modify x, we need to make a copy
> NumericVector x = clone<NumericVector>(x_);
>
> typedef std::map<double,int> imap ;
> typedef imap::value_type pair ;
> imap index ;
> int n = x.size() ;
> double current, previous = x[0] ;
> index.insert( pair( previous, 0 ) );
>
> imap::iterator it = index.begin() ;
> for( int i=1; i<n; i++){
> current = x[i] ;
> if( current == previous ){
> x[i] = current + ( ++(it->second) / 100.0 ) ;
> } else {
> it = index.find(current) ;
> if( it == index.end() ){
> it = index.insert(
> current > previous ? it : index.begin(),
> pair( current, 0 )
> ) ;
> } else {
> x[i] = current + ( ++(it->second) / 100.0 ) ;
> }
> previous = current ;
> }
> }
> return x ;
> ', plugin = "Rcpp" )
>
>
> which gives me this :
>
>  > x <- sample( 1:100000, size = 300000, replace = TRUE )
>  >
>  > system.time( nodup_cpp( x ) )
> utilisateur système écoulé
> 0.111 0.002 0.113
>  > system.time( nodup3( sort( x ) ) )
> utilisateur système écoulé
> 0.162 0.011 0.172
>  > system.time( nodup3a( sort( x ) ) )
> utilisateur système écoulé
> 0.099 0.009 0.109
>  > system.time( nodup4( sort( x ) ) )
> utilisateur système écoulé
> 0.089 0.004 0.094
>
> so nodup4 is still faster, but the values are not in the right order:
>
>  > x <- c( 2, 1, 1, 2 )
>  > nodup4( sort( x ) )
> [1] 1.00 1.01 2.00 2.01
>  > nodup_cpp( x )
> [1] 2.00 1.00 1.01 2.01
>
> Romain

I think this gives a more fair comparison :

> system.time( nodup_cpp( x ) )
utilisateur     système      écoulé
0.113       0.002       0.114
> system.time( { oo <- order(order(x)) ; nodup3( sort( x ) )[oo] } )
utilisateur     système      écoulé
0.336       0.012       0.347
> system.time( { oo <- order(order(x)) ; nodup3a( sort( x ) )[oo] } )
utilisateur     système      écoulé
0.251       0.011       0.262
> system.time( { oo <- order(order(x)) ; nodup4( sort( x ) )[oo] } )
utilisateur     système      écoulé
0.287       0.006       0.294

Romain

> Le 26/11/10 20:01, William Dunlap a écrit :
>>
>>> -----Original Message-----
>>> From: William Dunlap
>>> Sent: Thursday, November 25, 2010 9:31 AM
>>> To: 'randomcz'; r-help at r-project.org
>>> Subject: RE: [R] help: program efficiency
>>>
>>> If the input vector t is known to be ordered
>>> (or if you only care about runs of duplicated
>>> values, not all duplicated values) the following
>>> is pretty quick
>>>
>>> nodup3<- function (t) {
>>> t + (sequence(rle(t)\$lengths) - 1)/100
>>> }
>>>
>>> If you don't know if the the input will be ordered
>>> then ave() will do it a bit faster than your
>>> code
>>>
>>> nodup2<- function (t) {
>>> ave(t, t, FUN = function(x) x + (seq_along(x) - 1)/100)
>>> }
>>>
>>> E.g., for a sorted sequence of 300,000 numbers drawn with
>>> replacement from 1:100,000 I get:
>>>
>>>> a2<- sort(sample(1:1e5, size=3e5, replace=TRUE))
>>>> system.time(v<- nodup(a2))
>>> user system elapsed
>>> 2.78 0.05 3.97
>>>> system.time(v2<- nodup2(a2))
>>> user system elapsed
>>> 1.83 0.02 2.66
>>>> system.time(v3<- nodup3(a2))
>>> user system elapsed
>>> 0.18 0.00 0.14
>>>> identical(v,v2)&& identical(v,v3)
>>> [1] TRUE
>>>
>>> If speed is truly an issue, the built-in sequence may
>>> be replaced by a faster one that does the same thing:
>>>
>>> nodup3a<- function (t) {
>>> faster.sequence<- function(nvec) {
>>> seq_len(sum(nvec)) - rep(cumsum(c(0L, nvec[-length(nvec)])),
>>> nvec)
>>> }
>>> t + (faster.sequence(rle(t)\$lengths) - 1)/100
>>> }
>>>
>>> That took 0.05 seconds on the a2 dataset and produced
>>> identical results.
>>
>> rle() computes a sort of second difference and
>> nodup3a computes a cumsum on that second diffence,
>> to get back to a first difference. The following
>> avoids that wasted operation (along with rle's
>> computation of the values component of its output).
>>
>> nodup4<- function(t) {
>> n<- length(t)
>> p<- c(0L, which(t[-1L] != t[-n]), n)
>> t + ( seq_len(n) - rep.int(p[-length(p)] + 1L, diff(p)) ) /100
>> }
>>
>> That reduced nodup3a's time by about 30% on that dataset.
>>
>> Bill Dunlap
>> Spotfire, TIBCO Software
>> wdunlap tibco.com
>>
>>>> -----Original Message-----
>>>> From: r-help-bounces at r-project.org
>>>> [mailto:r-help-bounces at r-project.org] On Behalf Of randomcz
>>>> Sent: Thursday, November 25, 2010 6:49 AM
>>>> To: r-help at r-project.org
>>>> Subject: [R] help: program efficiency
>>>>
>>>>
>>>> hey guys,
>>>>
>>>> I am working on a function to make a duplicated value unique.
>>>> For example,
>>>> the original vector would be like : a = c(2,1,1,3,3,3,4)
>>>> I'll like to transform it into:
>>>> a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4
>>>> basically, find the duplicates and assign a unique value by
>>>> amount and keep it in order.
>>>> I come up with the following codes, but it runs slow if t is
>>>> large. Is there
>>>> a better way to do it?
>>>> nodup = function(t)
>>>> {
>>>> t.index=0
>>>> t.dup=duplicated(t)
>>>> for (i in 2:length(t))
>>>> {
>>>> if (t.dup[i]==T)
>>>> t.index=t.index+0.01
>>>> else t.index=0
>>>> t[i]=t[i]+t.index
>>>> }
>>>> return(t)
>>>> }
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://r.789695.n4.nabble.com/help-program-efficiency-tp305907
>>> 9p3059079.html
>>>> Sent from the R help mailing list archive at Nabble.com.
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>

--
Romain Francois
Professional R Enthusiast
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr
|- http://bit.ly/9VOd3l : ZAT! 2010
|- http://bit.ly/c6DzuX : Impressionnism with R