[BioC] summarizeOverlaps mode ignoring inter feature overlaps
Thomas Girke
thomas.girke at ucr.edu
Tue Apr 9 18:37:26 CEST 2013
Hi Valerie,
Perhaps let's call this more explicitly an ignore_inter_feature_overlap option
that we often need for cases like this:
Read1 ----
F1 ----------------
F2 -----------
where we would like to have at least the option to assign Read1 to both F1 and F2:
F1: 1
F2: 1
However, summarizeOverlapse doesn't count Read1 at all in all of its currently
available modes that I am aware of. This lack of an ignore_inter_feature_overlap
option is frequently a problem when working with poorly annotated genomes (high
uncertainty about hidden/incorrect feature overlaps) or various
RNA/ChIP-Seq _engineering_ projects where I rather take the risk of ambiguous read
assignments than not counting at all as shown above.
## Here is a code example illustrating the same case:
library(GenomicRanges); library(Rsamtools)
rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)),
pos = as.integer(c(500)),
cigar = rep("100M", 1), strand = strand(rep("+", 1)))
gr1 <- GRanges("chr1", IRanges(start=100, width=1001), strand="+", ID="feat1")
gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="feat2")
gr <- c(gr1, gr2)
## All of the three current modes in summarizeOverlaps return a count of zero
## because of its inter_feature_overlap awareness:
assays(summarizeOverlaps(gr, rd, mode="Union", ignore.strand=TRUE))$counts
assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", ignore.strand=TRUE))$counts
assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts
[,1]
[1,] 0
[2,] 0
## However, countOverlaps handles this correctly, but is not the best choice when
## counting multiple range features.
countOverlaps(gr, rd)
[1] 1 1
Thomas
> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] Rsamtools_1.12.0 Biostrings_2.28.0 GenomicRanges_1.12.1
[4] IRanges_1.18.0 BiocGenerics_0.6.0
loaded via a namespace (and not attached):
[1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0
On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain wrote:
> Hi Thomas,
>
> On 04/08/2013 05:52 PM, Thomas Girke wrote:
> > Dear Valerie,
> >
> > Is there currently any way to run summarizeOverlaps in a feature-overlap
> > unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently,
> > one can switch back to countOverlaps when feature overlap unawareness is
> > the more appropriate counting mode for a biological question, but then
> > double counting of reads mapping to multiple-range features is not
> > accounted for. It would be really nice to have such a feature-overlap
> > unaware option directly in summarizeOverlaps.
>
> No, we don't currently have an option to ignore feature-overlap. It
> sounds like you want to count with countOverlaps() but still want the
> double counting to be resolved. This is essentially what the other modes
> are doing so I must be missing something.
>
> In this example 2 reads hit feature A, 1 read hits feature B. With
> something like ignorefeature0L=TRUE, what results would you expect to
> see? Maybe you have another, more descriptive example?
>
> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3))
> features <- GRanges("chr1", IRanges(c(1, 20), width=10,
> names=c("A", "B")))
>
> > countOverlaps(features, reads)
> [1] 2 1
>
>
> >
> > Another question relates to the memory usage of summarizeOverlaps. Has
> > this been optimized yet? On a typical bam file with ~50-100 million
> > reads the memory usage of summarizeOverlaps is often around 10-20GB. To
> > use the function on a desktop computer or in large-scale RNA-Seq
> > projects on a commodity compute cluster, it would be desirable if every
> > counting instance would consume not more than 5GB of RAM.
>
> Have you tried the BamFileList-method? There is an example at the bottom
> of the ?BamFileList man page using summarizeOverlaps(). As Ryan
> mentioned, the key is to set the 'yieldSize' parameter when creating the
> BamFile. This method also makes use of mclapply().
>
> Valerie
>
> >
> > Thanks in advance for your help and suggestions,
> >
> > Thomas
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
More information about the Bioconductor
mailing list