[BioC] summarizeOverlaps mode ignoring inter feature overlaps
Thomas Girke
thomas.girke at ucr.edu
Wed Apr 10 22:43:08 CEST 2013
Yes, when trying to switch from countOverlaps to summarizeOverlaps I was
hoping the latter would provide all the functionalities of the former
plus some of the new features. Achieving the desirable result with a
custom function would be fine for me or other more advanced R users, but
I don't think it is a substitute for predefined and ~FDA approved~
counting modes where we want to have assurance that scientist A can
communicate to scientist B what counter s/he used and both get the same
result while not requiring any expert knowledge in R. I feel this is
particularly important for read counting since it is so fundamental to
many application areas in NGS sequence analysis.
Thomas
On Wed, Apr 10, 2013 at 04:59:53PM +0000, Valerie Obenchain wrote:
> I think he wants both the counts from using a 'mode' and counts for the
> intra-feature reads. Using countOverlaps(..., type="within") would get
> the counts for the intra-feature reads only and then you'd need to add
> them to the counts from using 'mode'.
>
> The interface for using findOverlaps/countOverlaps from
> summarizeOverlaps() alreay exists but may not be user-friendly or
> intuitive enough. For example, if you wanted to count with
> countOverlaps(..., type="within"), where you are interested in the reads
> falling "within" the features, the counter would be something like,
>
> counter <- function(x, y, ignore.strand)
> {
> counts <- subjectHits(findOverlaps(y, x, type="within",
> ignore.strand=ignore.strand))
> structure(tabulate(counts, NROW(x)), names=names(x))
> }
>
>
> Too cryptic?
>
> Valerie
>
>
> On 04/10/2013 09:16 AM, Michael Lawrence wrote:
> > It sounds like what Thomas wants should be achievable with
> > countOverlaps. For example, IntersectionStrict (if treating all features
> > independently) would equate to type="within". So we just need a
> > summarizeOverlaps-style interface to those simple utilities. I agree
> > this would be very useful.
> >
> > Michael
> >
> >
> > On Wed, Apr 10, 2013 at 8:50 AM, Valerie Obenchain <vobencha at fhcrc.org
> > <mailto:vobencha at fhcrc.org>> wrote:
> >
> > Ah, I see. You'd like to count with one of the existing modes but
> > have the option to pick up counts for these inter-feature reads
> > (fall completely 'within' >1 feature). These inter-feature reads
> > would be double (triple, quadruple, etc.) counted. Essentially they
> > would add one count to each feature they hit. Right?
> >
> > One more thought about memory usage. If you are working with
> > single-end reads the summarizeOverlaps,BamFileList-__method I
> > mentioned below should work fine. The approach is slightly different
> > with paired-end reads. Our current algorithm for pairing paired-end
> > reads requires the whole file to be read into memory. A different
> > approach is currently being developed but in the meantime you can
> > take the qname-sorted approach. The Bam file will need to be sorted
> > by qname and both the 'yieldSize' and 'obeyQname=TRUE' set when
> > creating the BamFile/BamFileList. An example is on ?BamFile.
> >
> > Valerie
> >
> >
> >
> > On 04/09/2013 08:01 PM, Thomas Girke wrote:
> >
> > Thanks for the tip. I guess doing it this way reverses the
> > counting mode
> > back to countOverlaps, but how can I use at the same time
> > "IntersectionStrict" or any of the other modes provided by
> > summarizeOverlaps if its mode argument is already used and
> > countOverlaps
> > doesn't accept one?
> >
> > Thomas
> >
> > On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote:
> >
> > Thanks for the example. You're right, none of the modes will
> > count a
> > read falling completely within >1 feature.
> >
> > You can supply your own function as a 'mode'. Two
> > requirements must be met:
> >
> > (1) The function must have 3 arguments that correspond to
> > 'features', 'reads' and 'ignore.strand', in that order.
> >
> > (2) The function must return a vector of counts the same length
> > as 'features'
> >
> > Here is an example using countOverlaps() which I think gets
> > at the
> > counting you want.
> >
> > counter <- function(x, y, ignore.strand)
> > countOverlaps(y, x, ignore.strand=ignore.strand)
> >
> > > assays(summarizeOverlaps(gr, rd, mode=counter))$counts
> > [,1]
> > [1,] 1
> > [2,] 1
> >
> >
> > Valerie
> >
> > On 04/09/2013 09:37 AM, Thomas Girke wrote:
> >
> > Hi Valerie,
> >
> > Perhaps let's call this more explicitly an
> > ignore_inter_feature_overlap option
> > that we often need for cases like this:
> >
> > Read1 ----
> > F1 ----------------
> > F2 -----------
> >
> > where we would like to have at least the option to
> > assign Read1 to both F1 and F2:
> >
> > F1: 1
> > F2: 1
> >
> > However, summarizeOverlapse doesn't count Read1 at all
> > in all of its currently
> > available modes that I am aware of. This lack of an
> > ignore_inter_feature_overlap
> > option is frequently a problem when working with poorly
> > annotated genomes (high
> > uncertainty about hidden/incorrect feature overlaps) or
> > various
> > RNA/ChIP-Seq _engineering_ projects where I rather take
> > the risk of ambiguous read
> > assignments than not counting at all as shown above.
> >
> > ## Here is a code example illustrating the same case:
> > library(GenomicRanges); library(Rsamtools)
> > rd <- GappedAlignments(letters[1], seqnames =
> > Rle(rep("chr1",1)),
> > pos = as.integer(c(500)),
> > cigar = rep("100M", 1), strand =
> > strand(rep("+", 1)))
> > gr1 <- GRanges("chr1", IRanges(start=100, width=1001),
> > strand="+", ID="feat1")
> > gr2 <- GRanges("chr1", IRanges(start=500, width=101),
> > strand="+", ID="feat2")
> > gr <- c(gr1, gr2)
> >
> > ## All of the three current modes in summarizeOverlaps
> > return a count of zero
> > ## because of its inter_feature_overlap awareness:
> > assays(summarizeOverlaps(gr, rd, mode="Union",
> > ignore.strand=TRUE))$counts
> > assays(summarizeOverlaps(gr, rd,
> > mode="IntersectionStrict", ignore.strand=TRUE))$counts
> > assays(summarizeOverlaps(gr, rd,
> > mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts
> > [,1]
> > [1,] 0
> > [2,] 0
> >
> > ## However, countOverlaps handles this correctly, but is
> > not the best choice when
> > ## counting multiple range features.
> > countOverlaps(gr, rd)
> > [1] 1 1
> >
> >
> > Thomas
> >
> > sessionInfo()
> >
> > R version 3.0.0 (2013-04-03)
> > Platform: x86_64-apple-darwin10.8.0 (64-bit)
> >
> > locale:
> > [1]
> > en_US.UTF-8/en_US.UTF-8/en_US.__UTF-8/C/en_US.UTF-8/en_US.UTF-__8
> >
> > attached base packages:
> > [1] parallel stats graphics grDevices utils
> > datasets methods
> > [8] base
> >
> > other attached packages:
> > [1] Rsamtools_1.12.0 Biostrings_2.28.0
> > GenomicRanges_1.12.1
> > [4] IRanges_1.18.0 BiocGenerics_0.6.0
> >
> > loaded via a namespace (and not attached):
> > [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0
> > zlibbioc_1.6.0
> >
> >
> > On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie
> > Obenchain wrote:
> >
> > Hi Thomas,
> >
> > On 04/08/2013 05:52 PM, Thomas Girke wrote:
> >
> > Dear Valerie,
> >
> > Is there currently any way to run
> > summarizeOverlaps in a feature-overlap
> > unaware mode, e.g with an
> > ignorefeatureOL=FALSE/TRUE setting? Currently,
> > one can switch back to countOverlaps when
> > feature overlap unawareness is
> > the more appropriate counting mode for a
> > biological question, but then
> > double counting of reads mapping to
> > multiple-range features is not
> > accounted for. It would be really nice to have
> > such a feature-overlap
> > unaware option directly in summarizeOverlaps.
> >
> >
> > No, we don't currently have an option to ignore
> > feature-overlap. It
> > sounds like you want to count with countOverlaps()
> > but still want the
> > double counting to be resolved. This is essentially
> > what the other modes
> > are doing so I must be missing something.
> >
> > In this example 2 reads hit feature A, 1 read hits
> > feature B. With
> > something like ignorefeature0L=TRUE, what results
> > would you expect to
> > see? Maybe you have another, more descriptive example?
> >
> > reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3))
> > features <- GRanges("chr1", IRanges(c(1, 20), width=10,
> > names=c("A", "B")))
> >
> > > countOverlaps(features, reads)
> > [1] 2 1
> >
> >
> >
> > Another question relates to the memory usage of
> > summarizeOverlaps. Has
> > this been optimized yet? On a typical bam file
> > with ~50-100 million
> > reads the memory usage of summarizeOverlaps is
> > often around 10-20GB. To
> > use the function on a desktop computer or in
> > large-scale RNA-Seq
> > projects on a commodity compute cluster, it
> > would be desirable if every
> > counting instance would consume not more than
> > 5GB of RAM.
> >
> >
> > Have you tried the BamFileList-method? There is an
> > example at the bottom
> > of the ?BamFileList man page using
> > summarizeOverlaps(). As Ryan
> > mentioned, the key is to set the 'yieldSize'
> > parameter when creating the
> > BamFile. This method also makes use of mclapply().
> >
> > Valerie
> >
> >
> > Thanks in advance for your help and suggestions,
> >
> > Thomas
> >
> > _________________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > <mailto:Bioconductor at r-project.org>
> > https://stat.ethz.ch/mailman/__listinfo/bioconductor
> > <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> > Search the archives:
> > http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> > <http://news.gmane.org/gmane.science.biology.informatics.conductor>
> >
> >
> > _________________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> > https://stat.ethz.ch/mailman/__listinfo/bioconductor
> > <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> > Search the archives:
> > http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http://news.gmane.org/gmane.science.biology.informatics.conductor>
> >
> >
More information about the Bioconductor
mailing list