[R] help: program efficiency

William Dunlap wdunlap at tibco.com
Fri Nov 26 20:01:21 CET 2010


> -----Original Message-----
> From: William Dunlap 
> Sent: Thursday, November 25, 2010 9:31 AM
> To: 'randomcz'; r-help at r-project.org
> Subject: RE: [R] help: program efficiency
> 
> If the input vector t is known to be ordered
> (or if you only care about runs of duplicated
> values, not all duplicated values) the following
> is pretty quick
> 
> nodup3 <- function (t) { 
>     t + (sequence(rle(t)$lengths) - 1)/100
> }
> 
> If you don't know if the the input will be ordered
> then ave() will do it a bit faster than your
> code
> 
> nodup2 <- function (t) { 
>     ave(t, t, FUN = function(x) x + (seq_along(x) - 1)/100)
> }
> 
> E.g., for a sorted sequence of 300,000 numbers drawn with
> replacement from 1:100,000 I get:
> 
> > a2 <- sort(sample(1:1e5, size=3e5, replace=TRUE))
> > system.time(v <- nodup(a2))
>    user  system elapsed 
>    2.78    0.05    3.97 
> > system.time(v2 <- nodup2(a2))
>    user  system elapsed 
>    1.83    0.02    2.66 
> > system.time(v3 <- nodup3(a2))
>    user  system elapsed 
>    0.18    0.00    0.14 
> > identical(v,v2) && identical(v,v3)
> [1] TRUE
> 
> If speed is truly an issue, the built-in sequence may
> be replaced by a faster one that does the same thing:
> 
> nodup3a <- function (t) {
>     faster.sequence <- function(nvec) {
>         seq_len(sum(nvec)) - rep(cumsum(c(0L, nvec[-length(nvec)])), 
>             nvec)
>     }
>     t + (faster.sequence(rle(t)$lengths) - 1)/100
> }
> 
> That took 0.05 seconds on the a2 dataset and produced
> identical results.

rle() computes a sort of second difference and
nodup3a computes a cumsum on that second diffence,
to get back to a first difference.  The following
avoids that wasted operation (along with rle's
computation of the values component of its output).

nodup4 <- function(t) {
    n <- length(t)
    p <- c(0L, which(t[-1L] != t[-n]), n)
    t + ( seq_len(n) - rep.int(p[-length(p)] + 1L, diff(p)) ) /100
}

That reduced nodup3a's time by about 30% on that dataset.
 
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  
 
> > -----Original Message-----
> > From: r-help-bounces at r-project.org 
> > [mailto:r-help-bounces at r-project.org] On Behalf Of randomcz
> > Sent: Thursday, November 25, 2010 6:49 AM
> > To: r-help at r-project.org
> > Subject: [R] help: program efficiency
> > 
> > 
> > hey guys,
> > 
> > I am working on a function to make a duplicated value unique. 
> > For example,
> > the original vector would be like : a = c(2,1,1,3,3,3,4)
> > I'll like to transform it into:
> > a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4
> > basically, find the duplicates and assign a unique value by 
> > adding a small
> > amount and keep it in order.
> > I come up with the following codes, but it runs slow if t is 
> > large. Is there
> > a better way to do it?
> > nodup = function(t)
> > {
> >   t.index=0
> >   t.dup=duplicated(t)
> >   for (i in 2:length(t))
> >   {
> >     if (t.dup[i]==T)
> >       t.index=t.index+0.01
> >     else t.index=0
> >     t[i]=t[i]+t.index
> >   }
> >   return(t)
> > }
> > 
> > 
> > -- 
> > View this message in context: 
> > http://r.789695.n4.nabble.com/help-program-efficiency-tp305907
> 9p3059079.html
> > Sent from the R help mailing list archive at Nabble.com.
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> > 



More information about the R-help mailing list