[Bioc-devel] poor performance of snpsByOverlaps()
Robert Castelo
robert.castelo at upf.edu
Tue Jun 21 19:21:56 CEST 2016
Vince,
thanks a lot for the example streaming dbSNP over the internet and how
this is even faster than accessing the data locally. to me, this just
confirms that the current performance of the
SNPlocs.Hsapiens.dbSNP144.GRCh37 annotation package can be improved.
Hervé will look at it and hopefully will find a fix, if there is a bug,
or a way to speed it up.
cheers,
robert.
On 06/17/2016 09:28 PM, Vincent Carey wrote:
> I think you can get relevant information rapidly from the dbsnp vcf.
> You would acquire
>
> ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-common_all.vcf.gz
>
> ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-common_all.vcf.gz.tbi
>
> and wrap in a TabixFile
>
>> tf
>
> class: TabixFile
>
> path: 00-common_all.vcf.gz
>
> index: 00-common_all.vcf.gz.tbi
>
> isOpen: FALSE
>
> yieldSize: NA
>
>
> rowRanges(readVcf(tf, param=ScanVcfParam(which=GRanges("10",
> IRanges(1,50000))), genome="hg19"))
>
> then returns fairly quickly. Perhaps AnnotationHub can address this
> issue. If you have the file locally,
>
>> system.time(
>
> + rowRanges(readVcf(tf, param=ScanVcfParam(which=GRanges("10",
> IRanges(1,50000))), genome="hg19")))
>
> user system elapsed
>
> 0.187 0.009 0.222
>
>
> If instead you read from NCBI
>
>> tf2 =
> "ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-common_all.vcf.gz"
>
>> system.time(
>
> + rowRanges(readVcf(tf2, param=ScanVcfParam(which=GRanges("10",
> IRanges(1,50000))), genome="hg19")))
>
> )
>
> user system elapsed
>
> 0.237 0.055 16.476
>
>
> faster than a speeding snplocs? but perhaps there is information loss
> or other diminished functionality
>
>
> On Fri, Jun 17, 2016 at 12:53 PM, Robert Castelo <robert.castelo at upf.edu
> <mailto:robert.castelo at upf.edu>> wrote:
>
> hi,
>
> the performance of snpsByOverlaps() in terms of time and memory
> consumption is quite poor and i wonder whether there is some bug in
> the code. here's one example:
>
> library(GenomicRanges)
> library(SNPlocs.Hsapiens.dbSNP144.GRCh37)
>
> snps <- SNPlocs.Hsapiens.dbSNP144.GRCh37
>
> gr <- GRanges(seqnames="ch10", IRanges(123276830, 123276830))
>
> system.time(ov <- snpsByOverlaps(snps, gr))
> user system elapsed
> 33.768 0.124 33.955
>
> system.time(ov <- snpsByOverlaps(snps, gr))
> user system elapsed
> 33.150 0.281 33.494
>
>
> i've shown the call to snpsByOverlaps() twice to account for the
> fact that maybe the first call was caching data and the second could
> be much faster, but it is not the case.
>
> if i do the same but with a larger GRanges object, for instance the
> one attached to this email, then the memory consumption grows until
> about 20 Gbytes. to me this in conjunction with the previous
> observation, suggests something wrong about the caching of the data.
>
>
>
> i look forward to your comments and possible solutions,
>
>
> thanks!!!
>
>
> robert.
> _______________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
--
Robert Castelo, PhD
Associate Professor
Dept. of Experimental and Health Sciences
Universitat Pompeu Fabra (UPF)
Barcelona Biomedical Research Park (PRBB)
Dr Aiguader 88
E-08003 Barcelona, Spain
telf: +34.933.160.514
fax: +34.933.160.550
More information about the Bioc-devel
mailing list