[BioC] summarizeOverlaps mode ignoring inter feature overlaps

Tue May 14 23:21:44 CEST 2013

Hi Thomas,

Two new args have been added to summarizeOverlaps(), 'inter.feature' and 
'fragments'. Available in GenomicRanges 1.13.11 and Rsamtools 1.13.13. 
The ?summarizeOverlaps page in GenomicRanges now has all examples (vs 
having half in GenomicRanges, half in Rsamtools).

'inter.feature':
When TRUE (default) counting is as it always was - reads that hit 
multiple features are resolved with one of the modes or dropped. When 
FALSE, each feature that a read hits get a count. This essentially boils 
down to countOverlaps() with type="any" (Union and IntersectionNotEmpty) 
or type="within" (IntersectionStrict).

'fragments':
This argument is relevant to counting paired-end Bam files. It was added 
because of the flexibility the GAlignmentsList class offers. The 
familiar GAlignmentPairs class holds reads that have been "properly 
mated" with the algorithm in ?findMateAlignment. GAlignmentsList can 
hold these "properly mated" reads as well the singletons, reads with 
unmapped pairs and any others in the Bam.

When TRUE (default), "properly mated" and others, are counted. You can 
of course still add your own filtering / QC with
param = ScanBamParam(). When FALSE, only reads that have been "properly 
mated" will be counted.

Let me know how it goes.
Valerie

On 04/08/13 17:52, Thomas Girke wrote:
> Dear Valerie,
>
> Is there currently any way to run summarizeOverlaps in a feature-overlap
> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently,
> one can switch back to countOverlaps when feature overlap unawareness is
> the more appropriate counting mode for a biological question, but then
> double counting of reads mapping to multiple-range features is not
> accounted for. It would be really nice to have such a feature-overlap
> unaware option directly in summarizeOverlaps.
>
> Another question relates to the memory usage of summarizeOverlaps. Has
> this been optimized yet? On a typical bam file with ~50-100 million
> reads the memory usage of summarizeOverlaps is often around 10-20GB. To
> use the function on a desktop computer or in large-scale RNA-Seq
> projects on a commodity compute cluster, it would be desirable if every
> counting instance would consume not more than 5GB of RAM.
>
> Thanks in advance for your help and suggestions,
>
> Thomas
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>