[R] Competing with SPSS and SAS: improving code that loops through rows (data manipulation)

Dimitri Liakhovitski ld7631 at gmail.com
Sat Mar 27 13:43:50 CET 2010


Dear all, thank you so much for your advice, and special thanks to
you, Jim, for digging into my code (which was too long).
I'll dig into yours now - it definitely looks very fast - and it's a
lot of great learning for me. Because you can see - I am just a
rudimentary programmer.
Thank you very-very much!
Dimitri

On Fri, Mar 26, 2010 at 7:28 PM, Jim Price <price_ja at hotmail.com> wrote:
>
> Here's my first stab. It removes some of the typical redundencies in your
> code (loops, building data frames by adding one column at a time) and
> instead does what is probably more canonical R style (although I'm willing
> to be corrected, as I suspect my code is a little suspect at times).
>
> For this example, I got a 10-fold speed-up, although I suspect this code
> will scale a lot better - primarily because I'm not continually expanding
> the data frames one column at a time, but instead working each part out
> separately and then sticking them together at the end. The key commands used
> (for when you look through the help files) are lapply, do.call, by and
> Reduce.
>
> If you use this scaled up you'd need to play with some of the indices in
> places, but I'm sure that's all pretty obvious.
>
> Oh, and because this is the usual (and good!) advice - don't call your data
> 'data':
>
> library(fortunes)
> fortune('dog')
>
>
>
> # This was your base set-up code
> set.seed(123)
> data<-data.frame(group=c(rep("first",10),rep("second",10)),week=c(1:10,1:10),a=abs(round(rnorm(20)*10,0)),
> b=abs(round(rnorm(20)*100,0)))
> data
>
>
> # Set up the ratio variables
> system.time({
> temp <- cbind(data, do.call(cbind, lapply(names(data)[3:4], function(.x)
>        {
>                unlist(by(data, data$group, function(.y) .y[,.x] / max(.y[,.x])))
>        })))
> colnames(temp)[5:6] <- paste(colnames(data)[3:4], 'ind.to.max', sep = '.')
> })
>
>
>
>
>
> system.time({
> constants <- expand.grid(vars = colnames(temp)[5:6], c1 = 1:3, c2 =
> seq(0.15, 0.45, 0.15))
>
>
> results <- lapply(seq(nrow(constants)), function(.x)
>        {
>                dat <- temp[, as.character(constants[.x, 1])]
>                d <- exp(1) ^ log(0.5) / constants[.x, 2]
>                l <- -10 * log(1 - constants[.x, 3])
>
>                unlist(by(dat, temp$group, function(.y)
>                        Reduce(function(.u, .v) 1 - ((1 - .u * d) / (exp(1) ^ (.v * l))), .y,
> accumulate = T, init = 0)[-1]))
>        })
>
> final <- cbind(temp, do.call(cbind, results))
> colnames(final)[-(1:6)] <- paste(substr(constants$vars, 1, 1), constants$c1,
> 100*constants$c2, '..transf', sep = '.')
> })
>
>
>
>
>
> Jim Price.
> Cardiome Pharma Corp.
>
>
> --
> View this message in context: http://n4.nabble.com/Competing-with-SPSS-and-SAS-improving-code-that-loops-through-rows-data-manipulation-tp1692848p1692967.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Dimitri Liakhovitski
Ninah.com
Dimitri.Liakhovitski at ninah.com



More information about the R-help mailing list