[R] Programming R to avoid loops

Brant Inman brant.inman at me.com
Sat Apr 18 06:14:46 CEST 2015


I have two large data frames with the following structure:

> df1
  id       date test1.result
1  a 2009-08-28      1
2  a 2009-09-16      1
3  b 2008-08-06      0
4  c 2012-02-02      1
5  c 2010-08-03      1
6  c 2012-08-02      0

> df2
  id       date test2.result
1  a 2011-02-03      1
2  b 2011-09-27      0
3  b 2011-09-01      1
4  c 2009-07-16      0
5  c 2009-04-15      0
6  c 2010-08-10      1

I need to match items in df2 to those in df1 with specific matching criteria. I have written a looped matching algorithm that works, but it is very slow with my large datasets. I am requesting help on making a version of this code that is faster and “vectorized" so to speak.

My algorithm is currently something like this code. It works but is damn slow.

findTestPairs <- function(test1, id1, date1, test2, id2, date2, predays=-30, 
                          lagdays=30){
  # Function to find, within subjects, two tests that occur with a timeframe
  #
  # test1 = the reference test result for which matching second tests are sought
  # test2 = the second test result
  # date1 = the date of test1
  # date2 = the date of test2
  # id1   = unique identifier for subject undergoing test 1
  # id2   = unique identifier for subject undergoing test 2
  # predays  = maximum number of days prior to test1 date that test2 date might occur
  # lagdays  = maximum number of days after test1 date that test2 date might occur
    
  result <- data.frame(matrix(ncol=5, nrow=length(test1)))
    colnames(result) <- c('id','test1','date','test2count',’test2lag.result')
    result$id    <- id1
    result$test1 <- test1
    result$date  <- date1
    
  for(i in 1:length(test1)){
    l <- 0    # Counter of test2 results that matches test1 within lag interval
    m <- NA   # Indicator of positive test2 within lag interval
        
    for(j in 1:length(test2)){
      if(id1[i] == id2[j]){               # STEP1: Match IDs
        interval <- date2[j] - date1[i]
        intmatch <- ifelse(interval >= predays && interval <= lagdays, 1, 0)

        if(intmatch == 1){                # STEP2: Does test2 fall within lag interval?
          l <- l+1                        # If test2 within lag interval, count it

          if(test2[j] == 1) {             # STEP3: Is test 2 positive?
            m <- 1                        # If test2 is positive, set indicator to 1
          } else {
            m <- 0
          }
        }
      }
    }  
    result$test2count[i] <- l
    result$test2lag.result[i] <- m
  }  
  return(result)
}  

I would appreciate help on building a faster matching algorithm. I am pretty certain that R functions can be used to do this but I do not have a good grasp of how to make it work.

Brant Inman
	[[alternative HTML version deleted]]



More information about the R-help mailing list