[Bioc-devel] Rsamtools BAM > SAM?

Martin Morgan mtmorgan at fhcrc.org
Thu Dec 5 17:19:12 CET 2013


On 12/05/2013 06:37 AM, Taku Tokuyasu wrote:
> Martin, Michael:
>
> Thank you for the prompt replies!  Glad to know the functionality is available.
>   The suggestions on alternatives are also quite helpful.  Amongst other
> reasons, we wish to support htseq-count because it is part of a reference
> RNA-seq protocol (http://www.ncbi.nlm.nih.gov/pubmed/23975260), and because I
> haven't gotten around to comparing timing and functionality with
> GenomicAlignments.  This is probably in a different thread, but e.g. is it known
> that GenomicAlignments::summarizeOverlaps() produces identical results to
> htseq-count with equivalent parameters?  What about timing?

Perhaps Valerie can chime in with specifics.

It is very hard to say that two algorithms are the 'same', in general the 
functionality of ht-seq and summarizeOverlaps is comparable; both support 
multiple counting modes (including in the case of summarizeOverlaps 
user-specified modes) but it would not be completely surprising if there were 
cases where the behaviour was different.

The parathyroidSE vignette

   http://bioconductor.org/packages/release/data/experiment/html/parathyroidSE.html

details use of summarizeOverlaps for counting.

The Rsubread manuscript Table 1

   http://www.ncbi.nlm.nih.gov/pubmed/24227677

contains a performance comparison, although the memory consumption of of 
summarizeOverlaps as reported in that paper appears to rely on reading an entire 
file in to memory (files are iterated in chunks when processed in the way 
described by the parathyroidSE vignette; this should have minimal cost for 
overall speed but provide more reasonable memory management). Rsubread is not 
available on Windows.

Martin

>
> _Taku
>
>
>
> On Thu, Dec 5, 2013 at 5:14 AM, Michael Lawrence <lawrence.michael at gene.com
> <mailto:lawrence.michael at gene.com>> wrote:
>
>     It would be appreciated to know where the current counting methods fall
>     short; i.e., why htseq-count is necessary.
>
>     Thanks,
>     Michael
>
>
>     On Thu, Dec 5, 2013 at 12:30 AM, Martin Morgan <mtmorgan at fhcrc.org
>     <mailto:mtmorgan at fhcrc.org>> wrote:
>
>         On 12/04/2013 11:17 PM, Taku Tokuyasu wrote:
>
>             Hello,
>
>             We are trying to support NGS pipelines with SAM input to
>             htseq-count, w/o
>             requiring a samtools install.  Has Rsamtools implemented / planned
>             any support
>             for BAM to SAM conversion?  I notice this has been requested before
>             (e.g. May
>             2012,
>             http://comments.gmane.org/__gmane.science.biology.__informatics.conductor/41344
>             <http://comments.gmane.org/gmane.science.biology.informatics.conductor/41344>),
>             so
>             I was wondering if there were any updates on this.
>
>
>         I added this to Rsamtools 1.15.14; this requires other package
>         dependencies to be current. I had delayed adding this because it seems a
>         step backward, to a relatively inefficient representation.
>
>         Other counting alternatives not requiring coercion to SAM files are
>         GenomicAlignments::__summarizeOverlaps (acting on BamFile / BamViews /
>         GAlignments / GAlignmentPairs, counting bins as GRanges / GRanges list
>         from TxDb or standard GTF etc files via rtracklayer::import) and
>         Rsubread::featureCount (SAM or BAM files, built-in or GTF or SAF
>         annotation files).
>
>         Please let me know if there are issues with the implementation.
>
>         Martin
>
>
>             Regards,
>
>             _Taku
>
>             Taku Tokuyasu
>             Computational Biology Core
>             UCSF Helen Diller Family Comprehensive Cancer Center
>
>
>
>
>         --
>         Computational Biology / Fred Hutchinson Cancer Research Center
>         1100 Fairview Ave. N.
>         PO Box 19024 Seattle, WA 98109
>
>         Location: Arnold Building M1 B861
>         Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>
>         _________________________________________________
>         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
>         https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>         <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list