[Bioc-devel] poor performance of snpsByOverlaps()

Vincent Carey stvjc at channing.harvard.edu
Fri Jun 17 21:28:55 CEST 2016


I think you can get relevant information rapidly from the dbsnp vcf.   You
would acquire

ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-common_all.vcf.gz

ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-common_all.vcf.gz.tbi

and wrap in a TabixFile

>  tf

class: TabixFile

path: 00-common_all.vcf.gz

index: 00-common_all.vcf.gz.tbi

isOpen: FALSE

yieldSize: NA


rowRanges(readVcf(tf, param=ScanVcfParam(which=GRanges("10",
IRanges(1,50000))), genome="hg19"))

then returns fairly quickly.  Perhaps AnnotationHub can address this
issue.  If you have the file locally,

> system.time(

+ rowRanges(readVcf(tf, param=ScanVcfParam(which=GRanges("10",
IRanges(1,50000))), genome="hg19")))

   user  system elapsed

  0.187   0.009   0.222


If instead you read from NCBI

> tf2 = "
ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-common_all.vcf.gz
"

> system.time(

+ rowRanges(readVcf(tf2, param=ScanVcfParam(which=GRanges("10",
IRanges(1,50000))), genome="hg19")))

)

   user  system elapsed

  0.237   0.055  16.476


faster than a speeding snplocs?  but perhaps there is information loss or
other diminished functionality

On Fri, Jun 17, 2016 at 12:53 PM, Robert Castelo <robert.castelo at upf.edu>
wrote:

> hi,
>
> the performance of snpsByOverlaps() in terms of time and memory
> consumption is quite poor and i wonder whether there is some bug in the
> code. here's one example:
>
> library(GenomicRanges)
> library(SNPlocs.Hsapiens.dbSNP144.GRCh37)
>
> snps <- SNPlocs.Hsapiens.dbSNP144.GRCh37
>
> gr <- GRanges(seqnames="ch10", IRanges(123276830, 123276830))
>
> system.time(ov <- snpsByOverlaps(snps, gr))
>    user  system elapsed
>  33.768   0.124  33.955
>
> system.time(ov <- snpsByOverlaps(snps, gr))
>    user  system elapsed
>  33.150   0.281  33.494
>
>
> i've shown the call to snpsByOverlaps() twice to account for the fact that
> maybe the first call was caching data and the second could be much faster,
> but it is not the case.
>
> if i do the same but with a larger GRanges object, for instance the one
> attached to this email, then the memory consumption grows until about 20
> Gbytes. to me this in conjunction with the previous observation, suggests
> something wrong about the caching of the data.
>
>
>
> i look forward to your comments and possible solutions,
>
>
> thanks!!!
>
>
> robert.
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list