[Bioc-devel] poor performance of snpsByOverlaps()
Vincent Carey
stvjc at channing.harvard.edu
Fri Jun 17 21:28:55 CEST 2016
I think you can get relevant information rapidly from the dbsnp vcf. You
would acquire
ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-common_all.vcf.gz
ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-common_all.vcf.gz.tbi
and wrap in a TabixFile
> tf
class: TabixFile
path: 00-common_all.vcf.gz
index: 00-common_all.vcf.gz.tbi
isOpen: FALSE
yieldSize: NA
rowRanges(readVcf(tf, param=ScanVcfParam(which=GRanges("10",
IRanges(1,50000))), genome="hg19"))
then returns fairly quickly. Perhaps AnnotationHub can address this
issue. If you have the file locally,
> system.time(
+ rowRanges(readVcf(tf, param=ScanVcfParam(which=GRanges("10",
IRanges(1,50000))), genome="hg19")))
user system elapsed
0.187 0.009 0.222
If instead you read from NCBI
> tf2 = "
ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-common_all.vcf.gz
"
> system.time(
+ rowRanges(readVcf(tf2, param=ScanVcfParam(which=GRanges("10",
IRanges(1,50000))), genome="hg19")))
)
user system elapsed
0.237 0.055 16.476
faster than a speeding snplocs? but perhaps there is information loss or
other diminished functionality
On Fri, Jun 17, 2016 at 12:53 PM, Robert Castelo <robert.castelo at upf.edu>
wrote:
> hi,
>
> the performance of snpsByOverlaps() in terms of time and memory
> consumption is quite poor and i wonder whether there is some bug in the
> code. here's one example:
>
> library(GenomicRanges)
> library(SNPlocs.Hsapiens.dbSNP144.GRCh37)
>
> snps <- SNPlocs.Hsapiens.dbSNP144.GRCh37
>
> gr <- GRanges(seqnames="ch10", IRanges(123276830, 123276830))
>
> system.time(ov <- snpsByOverlaps(snps, gr))
> user system elapsed
> 33.768 0.124 33.955
>
> system.time(ov <- snpsByOverlaps(snps, gr))
> user system elapsed
> 33.150 0.281 33.494
>
>
> i've shown the call to snpsByOverlaps() twice to account for the fact that
> maybe the first call was caching data and the second could be much faster,
> but it is not the case.
>
> if i do the same but with a larger GRanges object, for instance the one
> attached to this email, then the memory consumption grows until about 20
> Gbytes. to me this in conjunction with the previous observation, suggests
> something wrong about the caching of the data.
>
>
>
> i look forward to your comments and possible solutions,
>
>
> thanks!!!
>
>
> robert.
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list