[R] Needing a better solution to a lookup problem.
ilai
keren at math.montana.edu
Wed Mar 14 21:26:25 CET 2012
You could try doing it without a loop (.C or other):
(rgnsnp <- merge(region,snps))
(rgnsnp[with(rgnsnp,STOP>=POS & POS >= START),])
Here is my test for merge+search on 100k/200k:
fdf1 <- data.frame(chr=1:100000,p=runif(100000),d=sample(100000))
fdf2 <- data.frame(chr=rep(1:100000,2),s=runif(200000),t=runif(200000))
system.time(with(FDF <- merge(fdf2,fdf1),FDF[s>=p & p >= t,]))
user system elapsed
2.560 0.152 2.905
Hope this helps
Elai
On Wed, Mar 14, 2012 at 1:27 PM, Davis, Brian <Brian.Davis at uth.tmc.edu> wrote:
> I have a solution (actually a few) to this problem, but none are computationally efficient enough to be useful. I'm hoping someone can enlighten me to a better solution.
>
> I have data frame of chromosome/position pairs (along with other data for the location). For each pair I need to determine if it is with in a given data frame of ranges. I need to keep only the pairs that are within any of the ranges for further processing.
>
> Example:
> snps<-NULL
> snps$CHR<-c("1","2","2","3","X")
> snps$POS<-as.integer(c(295,640,670,100,1100))
> snps$DAT<-seq(1:length(snps$CHR))
> snps<-as.data.frame(snps, stringsAsFactors=FALSE)
>
> snps
> CHR POS DAT
> 1 1 295 1
> 2 2 640 2
> 3 2 670 3
> 4 3 100 4
> 5 X 1100 5
>
> region<-NULL
> region$CHR<-c("1","1","2","2","2","X")
> region$START<-as.integer(c(10,210,430,650,810,1090))
> region$STOP<-as.integer(c(100,350,630,675,850,1111))
> region<-as.data.frame(region, stringsAsFactors=FALSE)
>
> region
> CHR START STOP
> 1 1 10 100
> 2 1 210 350
> 3 2 430 630
> 4 2 650 675
> 5 2 810 850
> 6 X 1090 1111
>
>
> The result I need would look like
>
> Res
>
> CHR POS DAT
> 1 295 1
> 2 670 3
> X 1100 5
>
>
> I have a solution that works reasonably well on small sets, but my current data set is ~100K snp entries, and my regions table has ~200K entries. I have ~1500 files to go through
>
> I haven't found a good way to efficiently solve this problem. I've tried various versions of mapply/lapply, for loops, etc which get the answer for small sets but takes hours (per file) on my real data. Bioconductor seemed like the obvious place to look, but my GoogleFu must not be that great. I never found anything relevant.
>
> Any ideas or points to the right direction would be greatly appreciated.
>
>
>
> Brian Davis
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list