[R] Programming R to avoid loops
Charles C. Berry
ccberry at ucsd.edu
Sat Apr 18 19:48:17 CEST 2015
On Sat, 18 Apr 2015, Brant Inman wrote:
> I have two large data frames with the following structure:
>
>> df1
> id date test1.result
> 1 a 2009-08-28 1
> 2 a 2009-09-16 1
> 3 b 2008-08-06 0
> 4 c 2012-02-02 1
> 5 c 2010-08-03 1
> 6 c 2012-08-02 0
>
>> df2
> id date test2.result
> 1 a 2011-02-03 1
> 2 b 2011-09-27 0
> 3 b 2011-09-01 1
> 4 c 2009-07-16 0
> 5 c 2009-04-15 0
> 6 c 2010-08-10 1
>
> I need to match items in df2 to those in df1 with specific matching
> criteria. I have written a looped matching algorithm that works, but it
> is very slow with my large datasets. I am requesting help on making a
> version of this code that is faster and “vectorized" so to speak.
As I see in your posted code, you match id's exactly, dates according to a
range, and count the number of positive test result in the second
data.frame.
For this, the countOverlaps() function of the GenomicRanges package will
do the trick with suitably defined GRanges objects. Something like:
require(GenomicRanges)
date1 <- as.integer( as.Date( df1$date, "%Y-%m-%d" ))
date2 <- as.integer( as.Date( df2$date, "%Y-%m-%d" ))
lagdays <- 30L
predays <- -30L
gr1 <- GRanges(seqnames=df1$id, IRanges(start=date1,width=1),strand="*")
gr2 <- GRanges(seqnames=df2$id,
IRanges(start=date2+predays,end=date2+lagdays),
strand="*")[ df2$test2.result==1,]
df1$test2.count <- countOverlaps(gr1,gr2)
For the example data.frames (as rendered by Jim Lemon's code), this yields
> df1
id date test1.result test2.count
1 a 2009-08-28 1 0
2 a 2009-09-16 1 0
3 b 2008-08-06 0 0
4 c 2012-02-02 1 0
5 c 2010-08-03 1 1
6 c 2012-08-02 0 0
The GenomicRanges package is at
http://www.bioconductor.org/packages/release/bioc/html/GenomicRanges.html
where you will find installation instructions and links to vignettes.
HTH,
Chuck
More information about the R-help
mailing list