[Bioc-devel] poor performance of snpsByOverlaps()

Tue Jun 21 19:21:56 CEST 2016

Vince,

thanks a lot for the example streaming dbSNP over the internet and how 
this is even faster than accessing the data locally. to me, this just 
confirms that the current performance of the 
SNPlocs.Hsapiens.dbSNP144.GRCh37 annotation package can be improved. 
Hervé will look at it and hopefully will find a fix, if there is a bug, 
or a way to speed it up.

cheers,

robert.

On 06/17/2016 09:28 PM, Vincent Carey wrote:
> I think you can get relevant information rapidly from the dbsnp vcf.
> You would acquire
>
> ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-common_all.vcf.gz
>
> ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-common_all.vcf.gz.tbi
>
> and wrap in a TabixFile
>
>>   tf
>
> class: TabixFile
>
> path: 00-common_all.vcf.gz
>
> index: 00-common_all.vcf.gz.tbi
>
> isOpen: FALSE
>
> yieldSize: NA
>
>
> rowRanges(readVcf(tf, param=ScanVcfParam(which=GRanges("10",
> IRanges(1,50000))), genome="hg19"))
>
> then returns fairly quickly.  Perhaps AnnotationHub can address this
> issue.  If you have the file locally,
>
>>  system.time(
>
> + rowRanges(readVcf(tf, param=ScanVcfParam(which=GRanges("10",
> IRanges(1,50000))), genome="hg19")))
>
>     user  system elapsed
>
>    0.187   0.009   0.222
>
>
> If instead you read from NCBI
>
>>  tf2 =
> "ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-common_all.vcf.gz"
>
>>  system.time(
>
> + rowRanges(readVcf(tf2, param=ScanVcfParam(which=GRanges("10",
> IRanges(1,50000))), genome="hg19")))
>
> )
>
>     user  system elapsed
>
>    0.237   0.055  16.476
>
>
> faster than a speeding snplocs?  but perhaps there is information loss
> or other diminished functionality
>
>
> On Fri, Jun 17, 2016 at 12:53 PM, Robert Castelo <robert.castelo at upf.edu
> <mailto:robert.castelo at upf.edu>> wrote:
>
>     hi,
>
>     the performance of snpsByOverlaps() in terms of time and memory
>     consumption is quite poor and i wonder whether there is some bug in
>     the code. here's one example:
>
>     library(GenomicRanges)
>     library(SNPlocs.Hsapiens.dbSNP144.GRCh37)
>
>     snps <- SNPlocs.Hsapiens.dbSNP144.GRCh37
>
>     gr <- GRanges(seqnames="ch10", IRanges(123276830, 123276830))
>
>     system.time(ov <- snpsByOverlaps(snps, gr))
>         user  system elapsed
>       33.768   0.124  33.955
>
>     system.time(ov <- snpsByOverlaps(snps, gr))
>         user  system elapsed
>       33.150   0.281  33.494
>
>
>     i've shown the call to snpsByOverlaps() twice to account for the
>     fact that maybe the first call was caching data and the second could
>     be much faster, but it is not the case.
>
>     if i do the same but with a larger GRanges object, for instance the
>     one attached to this email, then the memory consumption grows until
>     about 20 Gbytes. to me this in conjunction with the previous
>     observation, suggests something wrong about the caching of the data.
>
>
>
>     i look forward to your comments and possible solutions,
>
>
>     thanks!!!
>
>
>     robert.
>     _______________________________________________
>     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
>     https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>

-- 
Robert Castelo, PhD
Associate Professor
Dept. of Experimental and Health Sciences
Universitat Pompeu Fabra (UPF)
Barcelona Biomedical Research Park (PRBB)
Dr Aiguader 88
E-08003 Barcelona, Spain
telf: +34.933.160.514
fax: +34.933.160.550