[R] How should I improve the following R code?

Tue Jan 8 01:57:23 CET 2008

On Mon, 7 Jan 2008, Seung Jun wrote:

> I'm looking for a way to improve code that's proven to be inefficient.
>

Jim was probably right on both counts (use Rprof and expect wtd.quantile 
to be the place where the time is being used).

If following his advice doesn't get you what you need, try vectorizing the 
whole lot by stacking the 'index'es and the 'count's. To see how to do 
this look at these plots:

> plot(rep(index,count))
> index <- 1:4
> count <- index*10
> plot(wtd.quantile( index, count, seq(0,1,by=0.001)))
> plot(rep(index,count))
>

and now this one where I 'stack' another table on top of the first one:

index.2 <- c(1,3)
count.2 <- c(30,40)

plot( rep( c( index, index.2 ), c ( count, count.2 ) ) )

As you can probably see, (for your case) wtd.quantile() is (in effect) 
doing a lookup and interpolation between points in those case in which an 
interpolation is needed.

The challenge for you is to figure out how to do the lookup without 
resorting to approx() - which is used by wtd.quantile(). Keeping track of 
the cumulative number of the stacked counts with cumsum(), the number in 
the each table, and the cumulative number of counts for all previous 
tables should get you there.

HTH,

Chuck

> Suppose that a data source generates the following table every minute:
>
>  Index  Count
>  ------------
>  0      234
>  1      120
>  7      11
>  30     1
>
> I save the tables in the following CSV format:
>
>  time,index,count
>  0,0:1:7:30,234:120:11:1
>  1,0:2:3:19,199:110:87:9
>
> That is, each line represents a table, and I have N lines for N minutes of
> data collection.
>
> Now, I wrote the following code to get quantiles for each time period:
>
>  library(Hmisc)
>  stbl  <- read.csv("data.csv")
>  index <- lapply(strsplit(stbl$index, ":", fixed = TRUE), as.numeric)
>  count <- lapply(strsplit(stbl$count, ":", fixed = TRUE), as.numeric)
>  len   <- length(index)
>  for (i in 1:len) {
>    v <- wtd.quantile(index[[i]], count[[i]], c(0, 0.2, 0.5, 0.8, 1))
>    stbl$q0[i] <- v[1]
>    stbl$q2[i] <- v[2]
>    stbl$q5[i] <- v[3]
>    stbl$q8[i] <- v[4]
>    stbl$q10[i] <- v[5]
>  }
>
> It works fine for a small N, but it get quickly inefficient as N grows.  The
> for-loop takes too long.  How could I improve the code or data
> representation so it can run fast?
>
> Thanks,
> Seung
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901