[Bioc-devel] Slow performance on scanBam

Martin Morgan mtmorgan at fhcrc.org
Tue May 13 14:16:39 CEST 2014


Hi James -- I don't think there's anything in existence to make this easier, but 
I'll expose something in the next 24 hours; is your data shareable? There might 
be deeper things to be done for processing this small-but-numerous style data.

Martin

On 05/12/2014 05:32 PM, James Bullard wrote:
> I've been dealing with some small bam files with millions of reference
> sequences leading to monster headers. As one might guess, this leads to
> pretty bad performance when calling scanBam.
>
> Right now, it takes a bit (27MB bam file, 16k alignments, 2.5 million
> reference sequences in the reference fasta file):
>
>> scanBam("sim.fasta-L27-ma8-mp6-rfg5_2-rdg3_1.bam")
>     user  system elapsed
> 186.264   0.528 186.934
>
> I've traced it down to scanBamHeader and seqinfo-BamFile, the problematic
> code is in scanBamHeader which processes the entire header when all seqinfo
> needs is the `targets` portion of the list. Additionally, the
> order(rankSeqLevels(.)) doesn't scale either. So I've replaced this as
> well. I've changed the body of seqinfo-BamFile from:
>
>      h <- scanBamHeader(x)[["targets"]]
>      o <- order(rankSeqlevels(names(h)))
>      Seqinfo(names(h)[o], unname(h)[o])
>
> to this:
>
>      if (!isOpen(x)) {
>          open(x)
>          on.exit(close(x))
>      }
>      h <- .Call(.read_bamfile_header, .extptr(x))$targets
>      Seqinfo(names(h), unname(h))
>
> We then get:
>
>> scanBam("sim.fasta-L27-ma8-mp6-rfg5_2-rdg3_1.bam")
>     user  system elapsed
>   14.780   0.360  15.158
>
> Which is still pretty slow for how small these files are, but I can
> probably live with that.
>
> Two questions:
> -- do we need the: order(rankSeqlevels(names(h))) bit? that does change the
> return value, but I can certainly live with the ordering in the file.
> -- what else am I missing?
>
> I can send a patch if need be, but would like to hear from the cognoscenti
> first if there is a "built-in" way to avoid this overhead.
>
> thanks, jim
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list