[R] Competing with SPSS and SAS: improving code that loops throughrows (data manipulation)

Tue Mar 30 15:32:49 CEST 2010

Below is the script that was based on your help - it works very fast:

### Creating the example data set:
set.seed(123)
MyData<-data.frame(group=c(rep("first",10),rep("second",10)),a=abs(round(rnorm(20,mean=0,
sd=.55),2)), b=abs(round(rnorm(20,mean=0, sd=.55),2)))
MyData

### Specifying parameters used in the code below:
vars<-names(MyData)[2:3] # names of variables to be transformed
nr.vars<-length(vars) # number of variables to be transformed
group.var<-names(MyData)[1] # name of the grouping variable

### For EACH subgroup: indexing variables a and b to their maximum in
that subgroup;
### These indexed variables will be used to build the new ones:
system.time({
temp <- cbind(MyData, do.call(cbind, lapply(vars, function(x){    #x<-"b"
  unlist(by(MyData, MyData[[group.var]], function(y) y[,x] / max(y[,x])))
})))
colnames(temp)[(length(MyData)+1):(length(MyData)+nr.vars)] <-
paste(vars, 'IndToMax', sep = '.')
})

# Grabbing names of the newly created variables that end with "IndToMax"
indexed.vars<-names(temp)[grep("IndToMax$", names(temp))] # variables
indexed to subgroup max

# Specifying parameters used for transformation below:
old.length<-length(temp)
hl<-c(.3,.6,1:5)
hrf<-seq(.15,.90,.15)

### Actual Transformation:
library(fortunes) # will use function Reduce from the package "fortunes"
system.time({
constants <- expand.grid(vars = indexed.vars, HL = hl, HRF = hrf)
results <- lapply(seq(nrow(constants)), function(x){
  dat <- temp[, as.character(constants[x, 1])]
  D <- exp(log(0.5) / constants[x, 2])
  L <- -10 * log(1 - constants[x, 3])
  unlist(by(dat, temp[[group.var]], function(y)  # function Reduce
    Reduce(function(u, v) 1 - ((1 - u * D) / (exp(v * L))), y,
accumulate = T, init = 0)[-1]))
  })
final <- cbind(temp, do.call(cbind, results))
colnames(final)[-(1:old.length)] <- paste(vars, constants$HL,
100*constants$HRF, '.transformed', sep = '.')
})

Thanks again for all your help!
Dimitri

On Mon, Mar 29, 2010 at 4:16 PM, Dimitri Liakhovitski <ld7631 at gmail.com> wrote:
> Would like to thank every one once more for your great help.
> I was able to reduce the time from god knows how many hours to about 2 minutes!
> Really appreciate it!
> Dimitri
>
> On Sat, Mar 27, 2010 at 11:43 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>> On 03/26/2010 06:40 PM, Dimitri Liakhovitski wrote:
>>> My sincere apologies if it looked large. Let me try again with less code.
>>> It's hard to do less than that. In fact - there is nothing in this
>>> code but 1 formula and many loops, which is the problem I am not sure
>>> how to solve.
>>> I also tried to be as clear as possible with the comments.
>>> Dimitri
>>>
>>> ## START OF THE CODE TO PRODUCE SMALL DATA EXAMPLE
>>>   set.seed(123)
>>>   data<-data.frame(group=c(rep("first",10),rep("second",10)),a=abs(round(rnorm(20,mean=0,
>>> sd=.55),2)), b=abs(round(rnorm(20,mean=0, sd=.55),2)))
>>>   data                   # "data" it is the data frame to work with
>>> ## END OF THE CODE TO PRODUCE SMALL DATA EXAMPLE. In real life "data"
>>> would contain up to 150-200 rows PER SUBGROUP
>>>
>>> ### Specifying useful parameters used in the slow code below:
>>> vars<-names(data)[2:3]                    # names of variables used in
>>> transformation; in real life - up to 50-60 variables
>>> group.var<-names(data)[1]                # name of the grouping variable
>>> subgroups<-levels(data[[group.var]])   # names of subgroups; in real
>>> life - up to 30 subgroups
>>>
>>> # OBJECTIVE:
>>> # Need to create new variables based on the old ones (a & b)
>>> # For each new variable, the value in a given row is a function of (a)
>>> 2 constants (that have several levels each),
>>> # (b) value of the original variable (e.g., a.ind.to.max"), and the
>>> value in the previous row on the same new variable
>>> # Plus - it has to be done by subgroup (variable "group")
>>>
>>> # Defining 2 constants:
>>> constant1<-c(1:3)                # constant 1 used in transformation -
>>> has 3 levels, in real life - up to 7 levels
>>> constant2<-seq(.15,.45,.15)  # constant 2 used in transformation - has
>>> 3 levels, in real life - up to 7 levels
>>>
>>> ### CODE THAT IS SLOW. Reason - too many loops with the inner-most
>>> loop being very slow - as it is looping through rows:
>>>
>>> for(var in vars){                               # looping through variables
>>>   for(c1 in 1:length(constant1)){        # looping through values of constant1
>>>        for(c2 in 1:length(constant2)){   # looping through values of constant2
>>>          d=log(0.5)/constant1[c1]
>>>           l=-log(1-constant2[c2])
>>>           name<-paste(var,constant1[c1],constant2[c2]*100,".transf",sep=".")
>>>           data[[name]]<-NA
>>>           for(subgroup in subgroups){     # looping through subgroups
>>>             data[data[[group.var]] %in% subgroup, name][1] =
>>> 1-((1-0*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup,
>>> var][1]*l*10)))
>>>
>>>       ### THIS SECTION IS THE SLOWEST - BECAUSE I AM LOOPING THROUGH ROWS:
>>>              for(case in 2:nrow(data[data[[group.var]] %in% subgroup,
>>> ])){ # looping through rows
>>>                data[data[[group.var]] %in% subgroup, name][case]=
>>> 1-((1-data[data[[group.var]] %in% subgroup,
>>> name][case-1]*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup,
>>> var][case]*l*10)))
>>>              }
>>>       ### END OF THE SLOWEST SECTION (INNERMOST LOOP)
>>
>> This is a good place to start! Look at the inner loop
>>
>> I notice there are a lot of expressions whose values don't change, but
>> are calculated repeatedly, e.g., exp(1) and data[[group.var]] %in%
>> subgroup. Let's move those common subexpressions out of the loop, e.g.,
>>
>> exp1 <- exp(1)
>> idx <- data[[group.var]] %in% subgroup
>>
>> for(case in 2:nrow(data[idx,])) {
>>    data[idx, name][case]=
>>        1-((1-data[idx, name][case-1]*exp1^d) /
>>           (exp1^(data[idx, var][case]*l*10)))
>> }
>>
>> wow that's already looking better! Looks like data[idx,var] could also
>> come out of the loop, and data[idx, name], too, so long as we assign
>> back in at the end
>>
>> exp1 <- exp(1)
>> idx <- data[[group.var]] %in% subgroup
>> a <- data[idx, var]
>> x <- data[idx, name]
>>
>> for(case in 2:length(x)) {
>>    x[case] <- 1 - (1 - x[case-1]*exp1^d) /
>>        (exp1^(a[case]*l*10))
>> }
>>
>> data[idx, name] <- x
>>
>> Still some redundant calculations, e.g., exp1^d, l*10. Some of the
>> things we've moved out of the innermost loop can be moved even further
>> out (e.g., exp1 all the way to the top). And a calculation like
>>
>>  exp1^(a[case] * l * 10)
>>
>> would be better written out of the loop as
>>
>>  b <- exp1^(a * l * 10)
>>
>> Why not try re-implementing your code with these ideas in mind? It'll be
>> important to make sure that the results of your original and new
>> implementation are the same (using all.equal, for instance). The 'CODE
>> THAT IS SLOW' part of your original code took 0.35s to execute on my
>> machine (using system.time), whereas my revised code with changes like
>> that above, and a couple of other small things, ran in 0.01s.
>>
>> What's your revised code look like now?
>>
>> Martin
>>
>>>           }
>>>       }
>>>    }
>>> }
>>>
>>> ### END OF THE CODE
>>>
>>> On Fri, Mar 26, 2010 at 5:25 PM, Bert Gunter <gunter.berton at gene.com> wrote:
>>>> Dmitri:
>>>>
>>>> If you follow the R posting guide you're more likely to get useful replies.
>>>> In particular it asks for **small** reproducible examples -- your example is
>>>> far more code then I care to spend time on anyway (others may be more
>>>> willing or more able to do so of course). I suggest you try (if you haven't
>>>> already):
>>>>
>>>> 1. Profiling the code using Rprof to isolate where the time is spent.And
>>>> then...
>>>>
>>>> 2. Writing a **small** reproducible example to exercise that portion of the
>>>> code and post it with your question to the list. If you need to...
>>>> Typically, if you do these things you'll  figure out how to fix the
>>>> situation on your own.
>>>>
>>>> Cheers,
>>>>
>>>> Bert Gunter
>>>> Genentech Nonclinical Statistics
>>>>
>>>> -----Original Message-----
>>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
>>>> Behalf Of Dimitri Liakhovitski
>>>> Sent: Friday, March 26, 2010 2:06 PM
>>>> To: r-help
>>>> Subject: [R] Competing with SPSS and SAS: improving code that loops
>>>> throughrows (data manipulation)
>>>>
>>>> Dear R-ers,
>>>>
>>>> In my question there are no statistics involved - it's all about data
>>>> manipulation in R.
>>>> I am trying to write a code that should replace what's currently being
>>>> done in SAS and SPSS. Or, at least, I am trying to show to my
>>>> colleagues R is not much worse than SAS/SPSS for the task at hand.
>>>> I've written a code that works but it's too slow. Probably because
>>>> it's looping through a lot of things. But I am not seeing how to
>>>> improve it. I've already written a different code but it's 5 times
>>>> slower than this one. The code below takes me slightly above 5 sec for
>>>> the tiny data set. I've tried using it with a real one - was not done
>>>> after hours.
>>>> Need help of the list! Maybe someone will have an idea on how to
>>>> increase the efficiency of my code (just one block of it - in the
>>>> "DATA TRANSFORMATION" Section below)?
>>>>
>>>> Below - I am creating the data set whose structure is similar to the
>>>> data sets the code should be applied to. Also - I have desribed what's
>>>> actually being done - in comments.
>>>> Thanks a lot to anyone for any suggestion!
>>>>
>>>> Dimitri
>>>>
>>>> ###### CREATING THE TEST DATA SET ################################
>>>>
>>>> set.seed(123)
>>>> data<-data.frame(group=c(rep("first",10),rep("second",10)),week=c(1:10,1:10)
>>>> ,a=abs(round(rnorm(20)*10,0)),
>>>> b=abs(round(rnorm(20)*100,0)))
>>>> data
>>>> dim(data)[1]  # !!! In real life I might have up to 150 (!) rows
>>>> (weeks) within each subgroup
>>>>
>>>> ### Specifying parameters used in the code below:
>>>> vars<-names(data)[3:4] # names of variables to be transformed
>>>> nr.vars<-length(vars) # number of variables to be transformed;  !!!
>>>> in real life I'll have to deal with up to 50-60 variables, not 2.
>>>> group.var<-names(data)[1] # name of the grouping variable
>>>> subgroups<-levels(data[[group.var]]) # names of subgroups;  !!! in
>>>> real life I'll have up to 20-25 subgroups, not 2.
>>>>
>>>> # For EACH subgroup: indexing variables a and b to their maximum in
>>>> that subgroup;
>>>> # Further, I'll have to use these indexed variables to build the new ones:
>>>> for(i in vars){
>>>>        new.name<-paste(i,".ind.to.max",sep="")
>>>>        data[[new.name]]<-NA
>>>> }
>>>>
>>>> indexed.vars<-names(data)[grep("ind.to.max$", names(data))] #
>>>> variables indexed to subgroup max
>>>> for(subgroup in subgroups){
>>>>        data[data[[group.var]] %in%
>>>> subgroup,indexed.vars]<-lapply(data[data[[group.var]] %in%
>>>> subgroup,vars],function(x){
>>>>                y<-x/max(x)
>>>>                return(y)
>>>>        })
>>>> }
>>>> data
>>>>
>>>> ############# DATA TRANSFORMATION #########################################
>>>>
>>>> # Objective: Create new variables based on the old ones (a and b ind.to.max)
>>>> # For each new variable, the value in a given row is a function of (a)
>>>> 2 constants (that have several levels each),
>>>> # (b) the corresponding value of the original variable (e.g.,
>>>> a.ind.to.max"), and the value in the previous row on the same new
>>>> variable
>>>> # PLUS: - it has to be done by subgroup (variable "group")
>>>>
>>>> constant1<-c(1:3)            # constant 1 used for transformation -
>>>> has 3 levels;  !!! in real life it will have up to 7 levels
>>>> constant2<-seq(.15,.45,.15)  # constant 2 used for transformation -
>>>> has 3 levels;  !!! in real life it will have up to 7 levels
>>>>
>>>> # CODE THAT IS TOO SLOW (it uses parameters specified in the previous
>>>> code section):
>>>> start1<-Sys.time()
>>>> for(var in indexed.vars){     # looping through variables
>>>>  for(c1 in 1:length(constant1)){     # looping through levels of constant1
>>>>          for(c2 in 1:length(constant2)){    # looping through levels of
>>>> constant2
>>>>      d=log(0.5)/constant1[c1]
>>>>      l=-log(1-constant2[c2])
>>>>
>>>> name<-paste(strsplit(var,".ind.to.max"),constant1[c1],constant2[c2]*100,"..t
>>>> ransf",sep=".")
>>>>      data[[name]]<-NA
>>>>      for(subgroup in subgroups){     # looping through subgroups
>>>>        data[data[[group.var]] %in% subgroup, name][1] =
>>>> 1-((1-0*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup,
>>>> var][1]*l*10)))  # this is just the very first row of each subgroup
>>>>        for(case in 2:nrow(data[data[[group.var]] %in% subgroup, ])){
>>>>  # looping through the remaining rows of the subgroup
>>>>          data[data[[group.var]] %in% subgroup, name][case]=
>>>> 1-((1-data[data[[group.var]] %in% subgroup,
>>>> name][case-1]*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup,
>>>> var][case]*l*10)))
>>>>                        }
>>>>                }
>>>>        }
>>>>  }
>>>> }
>>>> end1<-Sys.time()
>>>> print(end1-start1) # Takes me ~0.53 secs
>>>> names(data)
>>>> data
>>>>
>>>> --
>>>> Dimitri Liakhovitski
>>>> Ninah.com
>>>> Dimitri.Liakhovitski at ninah.com
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>> --
>> Martin Morgan
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>
>
>
>
> --
> Dimitri Liakhovitski
> Ninah.com
> Dimitri.Liakhovitski at ninah.com
>

-- 
Dimitri Liakhovitski
Ninah.com
Dimitri.Liakhovitski at ninah.com