[R] help: program efficiency
William Dunlap
wdunlap at tibco.com
Fri Nov 26 20:01:21 CET 2010
> -----Original Message-----
> From: William Dunlap
> Sent: Thursday, November 25, 2010 9:31 AM
> To: 'randomcz'; r-help at r-project.org
> Subject: RE: [R] help: program efficiency
>
> If the input vector t is known to be ordered
> (or if you only care about runs of duplicated
> values, not all duplicated values) the following
> is pretty quick
>
> nodup3 <- function (t) {
> t + (sequence(rle(t)$lengths) - 1)/100
> }
>
> If you don't know if the the input will be ordered
> then ave() will do it a bit faster than your
> code
>
> nodup2 <- function (t) {
> ave(t, t, FUN = function(x) x + (seq_along(x) - 1)/100)
> }
>
> E.g., for a sorted sequence of 300,000 numbers drawn with
> replacement from 1:100,000 I get:
>
> > a2 <- sort(sample(1:1e5, size=3e5, replace=TRUE))
> > system.time(v <- nodup(a2))
> user system elapsed
> 2.78 0.05 3.97
> > system.time(v2 <- nodup2(a2))
> user system elapsed
> 1.83 0.02 2.66
> > system.time(v3 <- nodup3(a2))
> user system elapsed
> 0.18 0.00 0.14
> > identical(v,v2) && identical(v,v3)
> [1] TRUE
>
> If speed is truly an issue, the built-in sequence may
> be replaced by a faster one that does the same thing:
>
> nodup3a <- function (t) {
> faster.sequence <- function(nvec) {
> seq_len(sum(nvec)) - rep(cumsum(c(0L, nvec[-length(nvec)])),
> nvec)
> }
> t + (faster.sequence(rle(t)$lengths) - 1)/100
> }
>
> That took 0.05 seconds on the a2 dataset and produced
> identical results.
rle() computes a sort of second difference and
nodup3a computes a cumsum on that second diffence,
to get back to a first difference. The following
avoids that wasted operation (along with rle's
computation of the values component of its output).
nodup4 <- function(t) {
n <- length(t)
p <- c(0L, which(t[-1L] != t[-n]), n)
t + ( seq_len(n) - rep.int(p[-length(p)] + 1L, diff(p)) ) /100
}
That reduced nodup3a's time by about 30% on that dataset.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> > -----Original Message-----
> > From: r-help-bounces at r-project.org
> > [mailto:r-help-bounces at r-project.org] On Behalf Of randomcz
> > Sent: Thursday, November 25, 2010 6:49 AM
> > To: r-help at r-project.org
> > Subject: [R] help: program efficiency
> >
> >
> > hey guys,
> >
> > I am working on a function to make a duplicated value unique.
> > For example,
> > the original vector would be like : a = c(2,1,1,3,3,3,4)
> > I'll like to transform it into:
> > a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4
> > basically, find the duplicates and assign a unique value by
> > adding a small
> > amount and keep it in order.
> > I come up with the following codes, but it runs slow if t is
> > large. Is there
> > a better way to do it?
> > nodup = function(t)
> > {
> > t.index=0
> > t.dup=duplicated(t)
> > for (i in 2:length(t))
> > {
> > if (t.dup[i]==T)
> > t.index=t.index+0.01
> > else t.index=0
> > t[i]=t[i]+t.index
> > }
> > return(t)
> > }
> >
> >
> > --
> > View this message in context:
> > http://r.789695.n4.nabble.com/help-program-efficiency-tp305907
> 9p3059079.html
> > Sent from the R help mailing list archive at Nabble.com.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
More information about the R-help
mailing list