[R] Programming R to avoid loops
Brant Inman
brant.inman at me.com
Sat Apr 18 06:14:46 CEST 2015
I have two large data frames with the following structure:
> df1
id date test1.result
1 a 2009-08-28 1
2 a 2009-09-16 1
3 b 2008-08-06 0
4 c 2012-02-02 1
5 c 2010-08-03 1
6 c 2012-08-02 0
> df2
id date test2.result
1 a 2011-02-03 1
2 b 2011-09-27 0
3 b 2011-09-01 1
4 c 2009-07-16 0
5 c 2009-04-15 0
6 c 2010-08-10 1
I need to match items in df2 to those in df1 with specific matching criteria. I have written a looped matching algorithm that works, but it is very slow with my large datasets. I am requesting help on making a version of this code that is faster and “vectorized" so to speak.
My algorithm is currently something like this code. It works but is damn slow.
findTestPairs <- function(test1, id1, date1, test2, id2, date2, predays=-30,
lagdays=30){
# Function to find, within subjects, two tests that occur with a timeframe
#
# test1 = the reference test result for which matching second tests are sought
# test2 = the second test result
# date1 = the date of test1
# date2 = the date of test2
# id1 = unique identifier for subject undergoing test 1
# id2 = unique identifier for subject undergoing test 2
# predays = maximum number of days prior to test1 date that test2 date might occur
# lagdays = maximum number of days after test1 date that test2 date might occur
result <- data.frame(matrix(ncol=5, nrow=length(test1)))
colnames(result) <- c('id','test1','date','test2count',’test2lag.result')
result$id <- id1
result$test1 <- test1
result$date <- date1
for(i in 1:length(test1)){
l <- 0 # Counter of test2 results that matches test1 within lag interval
m <- NA # Indicator of positive test2 within lag interval
for(j in 1:length(test2)){
if(id1[i] == id2[j]){ # STEP1: Match IDs
interval <- date2[j] - date1[i]
intmatch <- ifelse(interval >= predays && interval <= lagdays, 1, 0)
if(intmatch == 1){ # STEP2: Does test2 fall within lag interval?
l <- l+1 # If test2 within lag interval, count it
if(test2[j] == 1) { # STEP3: Is test 2 positive?
m <- 1 # If test2 is positive, set indicator to 1
} else {
m <- 0
}
}
}
}
result$test2count[i] <- l
result$test2lag.result[i] <- m
}
return(result)
}
I would appreciate help on building a faster matching algorithm. I am pretty certain that R functions can be used to do this but I do not have a good grasp of how to make it work.
Brant Inman
[[alternative HTML version deleted]]
More information about the R-help
mailing list