[BioC] summarizeOverlaps mode ignoring inter feature overlaps
Thomas Girke
thomas.girke at ucr.edu
Wed Apr 10 18:23:31 CEST 2013
Valerie,
Please see my inserted comments below.
On Wed, Apr 10, 2013 at 03:50:54PM +0000, Valerie Obenchain wrote:
> Ah, I see. You'd like to count with one of the existing modes but have
> the option to pick up counts for these inter-feature reads (fall
> completely 'within' >1 feature). These inter-feature reads would be
> double (triple, quadruple, etc.) counted. Essentially they would add one
> count to each feature they hit. Right?
Correct. Perhaps let's discuss this with a very common example of
transcript-level counting rather than counting on the unified (virtual)
exonic regions of genes. With the current description provided in the
summarizeOverlaps vignette at
http://www.bioconductor.org/packages/2.12/bioc/vignettes/GenomicRanges/inst/doc/summarizeOverlaps.pdf
I don't see how this can be achieved without falling back to using
countOverlaps without loosing the new counting modes provided by
summarizeOverlaps?
>
> One more thought about memory usage. If you are working with single-end
> reads the summarizeOverlaps,BamFileList-method I mentioned below should
> work fine. The approach is slightly different with paired-end reads. Our
> current algorithm for pairing paired-end reads requires the whole file
> to be read into memory. A different approach is currently being
> developed but in the meantime you can take the qname-sorted approach.
> The Bam file will need to be sorted by qname and both the 'yieldSize'
> and 'obeyQname=TRUE' set when creating the BamFile/BamFileList. An
> example is on ?BamFile.
Thanks for pointing this out. My fault that I didn't read through all
the linked documentation. Perhaps it may not be a bad idea to make the
memory restricted bam read instances the default setting in the future.
This will definitely help biologists using those utilities without
crashing their machines on the first attempt.
Thomas
>
> Valerie
>
>
> On 04/09/2013 08:01 PM, Thomas Girke wrote:
> > Thanks for the tip. I guess doing it this way reverses the counting mode
> > back to countOverlaps, but how can I use at the same time
> > "IntersectionStrict" or any of the other modes provided by
> > summarizeOverlaps if its mode argument is already used and countOverlaps
> > doesn't accept one?
> >
> > Thomas
> >
> > On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote:
> >> Thanks for the example. You're right, none of the modes will count a
> >> read falling completely within >1 feature.
> >>
> >> You can supply your own function as a 'mode'. Two requirements must be met:
> >>
> >> (1) The function must have 3 arguments that correspond to
> >> 'features', 'reads' and 'ignore.strand', in that order.
> >>
> >> (2) The function must return a vector of counts the same length
> >> as 'features'
> >>
> >> Here is an example using countOverlaps() which I think gets at the
> >> counting you want.
> >>
> >> counter <- function(x, y, ignore.strand)
> >> countOverlaps(y, x, ignore.strand=ignore.strand)
> >>
> >> > assays(summarizeOverlaps(gr, rd, mode=counter))$counts
> >> [,1]
> >> [1,] 1
> >> [2,] 1
> >>
> >>
> >> Valerie
> >>
> >> On 04/09/2013 09:37 AM, Thomas Girke wrote:
> >>> Hi Valerie,
> >>>
> >>> Perhaps let's call this more explicitly an ignore_inter_feature_overlap option
> >>> that we often need for cases like this:
> >>>
> >>> Read1 ----
> >>> F1 ----------------
> >>> F2 -----------
> >>>
> >>> where we would like to have at least the option to assign Read1 to both F1 and F2:
> >>>
> >>> F1: 1
> >>> F2: 1
> >>>
> >>> However, summarizeOverlapse doesn't count Read1 at all in all of its currently
> >>> available modes that I am aware of. This lack of an ignore_inter_feature_overlap
> >>> option is frequently a problem when working with poorly annotated genomes (high
> >>> uncertainty about hidden/incorrect feature overlaps) or various
> >>> RNA/ChIP-Seq _engineering_ projects where I rather take the risk of ambiguous read
> >>> assignments than not counting at all as shown above.
> >>>
> >>> ## Here is a code example illustrating the same case:
> >>> library(GenomicRanges); library(Rsamtools)
> >>> rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)),
> >>> pos = as.integer(c(500)),
> >>> cigar = rep("100M", 1), strand = strand(rep("+", 1)))
> >>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001), strand="+", ID="feat1")
> >>> gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="feat2")
> >>> gr <- c(gr1, gr2)
> >>>
> >>> ## All of the three current modes in summarizeOverlaps return a count of zero
> >>> ## because of its inter_feature_overlap awareness:
> >>> assays(summarizeOverlaps(gr, rd, mode="Union", ignore.strand=TRUE))$counts
> >>> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", ignore.strand=TRUE))$counts
> >>> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts
> >>> [,1]
> >>> [1,] 0
> >>> [2,] 0
> >>>
> >>> ## However, countOverlaps handles this correctly, but is not the best choice when
> >>> ## counting multiple range features.
> >>> countOverlaps(gr, rd)
> >>> [1] 1 1
> >>>
> >>>
> >>> Thomas
> >>>
> >>>> sessionInfo()
> >>> R version 3.0.0 (2013-04-03)
> >>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
> >>>
> >>> locale:
> >>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> >>>
> >>> attached base packages:
> >>> [1] parallel stats graphics grDevices utils datasets methods
> >>> [8] base
> >>>
> >>> other attached packages:
> >>> [1] Rsamtools_1.12.0 Biostrings_2.28.0 GenomicRanges_1.12.1
> >>> [4] IRanges_1.18.0 BiocGenerics_0.6.0
> >>>
> >>> loaded via a namespace (and not attached):
> >>> [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0
> >>>
> >>>
> >>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain wrote:
> >>>> Hi Thomas,
> >>>>
> >>>> On 04/08/2013 05:52 PM, Thomas Girke wrote:
> >>>>> Dear Valerie,
> >>>>>
> >>>>> Is there currently any way to run summarizeOverlaps in a feature-overlap
> >>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently,
> >>>>> one can switch back to countOverlaps when feature overlap unawareness is
> >>>>> the more appropriate counting mode for a biological question, but then
> >>>>> double counting of reads mapping to multiple-range features is not
> >>>>> accounted for. It would be really nice to have such a feature-overlap
> >>>>> unaware option directly in summarizeOverlaps.
> >>>>
> >>>> No, we don't currently have an option to ignore feature-overlap. It
> >>>> sounds like you want to count with countOverlaps() but still want the
> >>>> double counting to be resolved. This is essentially what the other modes
> >>>> are doing so I must be missing something.
> >>>>
> >>>> In this example 2 reads hit feature A, 1 read hits feature B. With
> >>>> something like ignorefeature0L=TRUE, what results would you expect to
> >>>> see? Maybe you have another, more descriptive example?
> >>>>
> >>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3))
> >>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10,
> >>>> names=c("A", "B")))
> >>>>
> >>>> > countOverlaps(features, reads)
> >>>> [1] 2 1
> >>>>
> >>>>
> >>>>>
> >>>>> Another question relates to the memory usage of summarizeOverlaps. Has
> >>>>> this been optimized yet? On a typical bam file with ~50-100 million
> >>>>> reads the memory usage of summarizeOverlaps is often around 10-20GB. To
> >>>>> use the function on a desktop computer or in large-scale RNA-Seq
> >>>>> projects on a commodity compute cluster, it would be desirable if every
> >>>>> counting instance would consume not more than 5GB of RAM.
> >>>>
> >>>> Have you tried the BamFileList-method? There is an example at the bottom
> >>>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan
> >>>> mentioned, the key is to set the 'yieldSize' parameter when creating the
> >>>> BamFile. This method also makes use of mclapply().
> >>>>
> >>>> Valerie
> >>>>
> >>>>>
> >>>>> Thanks in advance for your help and suggestions,
> >>>>>
> >>>>> Thomas
> >>>>>
> >>>>> _______________________________________________
> >>>>> Bioconductor mailing list
> >>>>> Bioconductor at r-project.org
> >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>>>
More information about the Bioconductor
mailing list