[Bioc-devel] Slow performance on scanBam
Martin Morgan
mtmorgan at fhcrc.org
Tue May 13 14:16:39 CEST 2014
Hi James -- I don't think there's anything in existence to make this easier, but
I'll expose something in the next 24 hours; is your data shareable? There might
be deeper things to be done for processing this small-but-numerous style data.
Martin
On 05/12/2014 05:32 PM, James Bullard wrote:
> I've been dealing with some small bam files with millions of reference
> sequences leading to monster headers. As one might guess, this leads to
> pretty bad performance when calling scanBam.
>
> Right now, it takes a bit (27MB bam file, 16k alignments, 2.5 million
> reference sequences in the reference fasta file):
>
>> scanBam("sim.fasta-L27-ma8-mp6-rfg5_2-rdg3_1.bam")
> user system elapsed
> 186.264 0.528 186.934
>
> I've traced it down to scanBamHeader and seqinfo-BamFile, the problematic
> code is in scanBamHeader which processes the entire header when all seqinfo
> needs is the `targets` portion of the list. Additionally, the
> order(rankSeqLevels(.)) doesn't scale either. So I've replaced this as
> well. I've changed the body of seqinfo-BamFile from:
>
> h <- scanBamHeader(x)[["targets"]]
> o <- order(rankSeqlevels(names(h)))
> Seqinfo(names(h)[o], unname(h)[o])
>
> to this:
>
> if (!isOpen(x)) {
> open(x)
> on.exit(close(x))
> }
> h <- .Call(.read_bamfile_header, .extptr(x))$targets
> Seqinfo(names(h), unname(h))
>
> We then get:
>
>> scanBam("sim.fasta-L27-ma8-mp6-rfg5_2-rdg3_1.bam")
> user system elapsed
> 14.780 0.360 15.158
>
> Which is still pretty slow for how small these files are, but I can
> probably live with that.
>
> Two questions:
> -- do we need the: order(rankSeqlevels(names(h))) bit? that does change the
> return value, but I can certainly live with the ordering in the file.
> -- what else am I missing?
>
> I can send a patch if need be, but would like to hear from the cognoscenti
> first if there is a "built-in" way to avoid this overhead.
>
> thanks, jim
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-devel
mailing list