[BioC] summarizeOverlaps mode ignoring inter feature overlaps

Tue Apr 9 18:37:26 CEST 2013

Hi Valerie,

Perhaps let's call this more explicitly an ignore_inter_feature_overlap option
that we often need for cases like this:

Read1    ----
F1 ----------------
F2   -----------

where we would like to have at least the option to assign Read1 to both F1 and F2:

F1: 1
F2: 1

However, summarizeOverlapse doesn't count Read1 at all in all of its currently
available modes that I am aware of. This lack of an ignore_inter_feature_overlap 
option is frequently a problem when working with poorly annotated genomes (high 
uncertainty about hidden/incorrect feature overlaps) or various 
RNA/ChIP-Seq _engineering_ projects where I rather take the risk of ambiguous read 
assignments than not counting at all as shown above. 

## Here is a code example illustrating the same case:
library(GenomicRanges); library(Rsamtools)
rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)),
      pos = as.integer(c(500)),
      cigar = rep("100M", 1), strand = strand(rep("+", 1)))
gr1 <- GRanges("chr1", IRanges(start=100, width=1001), strand="+", ID="feat1")
gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="feat2")
gr <- c(gr1, gr2)

## All of the three current modes in summarizeOverlaps return a count of zero 
## because of its inter_feature_overlap awareness:
assays(summarizeOverlaps(gr, rd, mode="Union", ignore.strand=TRUE))$counts
assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", ignore.strand=TRUE))$counts
assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts
     [,1]
[1,]    0
[2,]    0

## However, countOverlaps handles this correctly, but is not the best choice when
## counting multiple range features.
countOverlaps(gr, rd)
[1] 1 1

Thomas

> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] Rsamtools_1.12.0     Biostrings_2.28.0    GenomicRanges_1.12.1
[4] IRanges_1.18.0       BiocGenerics_0.6.0

loaded via a namespace (and not attached):
[1] bitops_1.0-5   stats4_3.0.0   tools_3.0.0    zlibbioc_1.6.0

On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain wrote:
> Hi Thomas,
> 
> On 04/08/2013 05:52 PM, Thomas Girke wrote:
> > Dear Valerie,
> >
> > Is there currently any way to run summarizeOverlaps in a feature-overlap
> > unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently,
> > one can switch back to countOverlaps when feature overlap unawareness is
> > the more appropriate counting mode for a biological question, but then
> > double counting of reads mapping to multiple-range features is not
> > accounted for. It would be really nice to have such a feature-overlap
> > unaware option directly in summarizeOverlaps.
> 
> No, we don't currently have an option to ignore feature-overlap. It 
> sounds like you want to count with countOverlaps() but still want the 
> double counting to be resolved. This is essentially what the other modes 
> are doing so I must be missing something.
> 
> In this example 2 reads hit feature A, 1 read hits feature B. With 
> something like ignorefeature0L=TRUE, what results would you expect to 
> see? Maybe you have another, more descriptive example?
> 
> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3))
> features <- GRanges("chr1", IRanges(c(1, 20), width=10,
>                      names=c("A", "B")))
> 
>  > countOverlaps(features, reads)
> [1] 2 1
> 
> 
> >
> > Another question relates to the memory usage of summarizeOverlaps. Has
> > this been optimized yet? On a typical bam file with ~50-100 million
> > reads the memory usage of summarizeOverlaps is often around 10-20GB. To
> > use the function on a desktop computer or in large-scale RNA-Seq
> > projects on a commodity compute cluster, it would be desirable if every
> > counting instance would consume not more than 5GB of RAM.
> 
> Have you tried the BamFileList-method? There is an example at the bottom 
> of the ?BamFileList man page using summarizeOverlaps(). As Ryan 
> mentioned, the key is to set the 'yieldSize' parameter when creating the 
> BamFile. This method also makes use of mclapply().
> 
> Valerie
> 
> >
> > Thanks in advance for your help and suggestions,
> >
> > Thomas
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >