[R] How should I improve the following R code?
Charles C. Berry
cberry at tajo.ucsd.edu
Tue Jan 8 01:57:23 CET 2008
On Mon, 7 Jan 2008, Seung Jun wrote:
> I'm looking for a way to improve code that's proven to be inefficient.
>
Jim was probably right on both counts (use Rprof and expect wtd.quantile
to be the place where the time is being used).
If following his advice doesn't get you what you need, try vectorizing the
whole lot by stacking the 'index'es and the 'count's. To see how to do
this look at these plots:
> plot(rep(index,count))
> index <- 1:4
> count <- index*10
> plot(wtd.quantile( index, count, seq(0,1,by=0.001)))
> plot(rep(index,count))
>
and now this one where I 'stack' another table on top of the first one:
index.2 <- c(1,3)
count.2 <- c(30,40)
plot( rep( c( index, index.2 ), c ( count, count.2 ) ) )
As you can probably see, (for your case) wtd.quantile() is (in effect)
doing a lookup and interpolation between points in those case in which an
interpolation is needed.
The challenge for you is to figure out how to do the lookup without
resorting to approx() - which is used by wtd.quantile(). Keeping track of
the cumulative number of the stacked counts with cumsum(), the number in
the each table, and the cumulative number of counts for all previous
tables should get you there.
HTH,
Chuck
> Suppose that a data source generates the following table every minute:
>
> Index Count
> ------------
> 0 234
> 1 120
> 7 11
> 30 1
>
> I save the tables in the following CSV format:
>
> time,index,count
> 0,0:1:7:30,234:120:11:1
> 1,0:2:3:19,199:110:87:9
>
> That is, each line represents a table, and I have N lines for N minutes of
> data collection.
>
> Now, I wrote the following code to get quantiles for each time period:
>
> library(Hmisc)
> stbl <- read.csv("data.csv")
> index <- lapply(strsplit(stbl$index, ":", fixed = TRUE), as.numeric)
> count <- lapply(strsplit(stbl$count, ":", fixed = TRUE), as.numeric)
> len <- length(index)
> for (i in 1:len) {
> v <- wtd.quantile(index[[i]], count[[i]], c(0, 0.2, 0.5, 0.8, 1))
> stbl$q0[i] <- v[1]
> stbl$q2[i] <- v[2]
> stbl$q5[i] <- v[3]
> stbl$q8[i] <- v[4]
> stbl$q10[i] <- v[5]
> }
>
> It works fine for a small N, but it get quickly inefficient as N grows. The
> for-loop takes too long. How could I improve the code or data
> representation so it can run fast?
>
> Thanks,
> Seung
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
More information about the R-help
mailing list