[BioC] summarizeOverlaps mode ignoring inter feature overlaps

Wed Apr 10 18:59:53 CEST 2013

I think he wants both the counts from using a 'mode' and counts for the 
intra-feature reads. Using countOverlaps(..., type="within") would get 
the counts for the intra-feature reads only and then you'd need to add 
them to the counts from using 'mode'.

The interface for using findOverlaps/countOverlaps from 
summarizeOverlaps() alreay exists but may not be user-friendly or 
intuitive enough. For example, if you wanted to count with 
countOverlaps(..., type="within"), where you are interested in the reads 
falling "within" the features, the counter would be something like,

counter <- function(x, y,  ignore.strand)
{
     counts <- subjectHits(findOverlaps(y, x, type="within",
                                        ignore.strand=ignore.strand))
     structure(tabulate(counts, NROW(x)), names=names(x))
}

Too cryptic?

Valerie

On 04/10/2013 09:16 AM, Michael Lawrence wrote:
> It sounds like what Thomas wants should be achievable with
> countOverlaps. For example, IntersectionStrict (if treating all features
> independently) would equate to type="within". So we just need a
> summarizeOverlaps-style interface to those simple utilities. I agree
> this would be very useful.
>
> Michael
>
>
> On Wed, Apr 10, 2013 at 8:50 AM, Valerie Obenchain <vobencha at fhcrc.org
> <mailto:vobencha at fhcrc.org>> wrote:
>
>     Ah, I see. You'd like to count with one of the existing modes but
>     have the option to pick up counts for these inter-feature reads
>     (fall completely 'within' >1 feature). These inter-feature reads
>     would be double (triple, quadruple, etc.) counted. Essentially they
>     would add one count to each feature they hit. Right?
>
>     One more thought about memory usage. If you are working with
>     single-end reads the summarizeOverlaps,BamFileList-__method I
>     mentioned below should work fine. The approach is slightly different
>     with paired-end reads. Our current algorithm for pairing paired-end
>     reads requires the whole file to be read into memory. A different
>     approach is currently being developed but in the meantime you can
>     take the qname-sorted approach. The Bam file will need to be sorted
>     by qname and both the 'yieldSize' and 'obeyQname=TRUE' set when
>     creating the BamFile/BamFileList. An example is on ?BamFile.
>
>     Valerie
>
>
>
>     On 04/09/2013 08:01 PM, Thomas Girke wrote:
>
>         Thanks for the tip. I guess doing it this way reverses the
>         counting mode
>         back to countOverlaps, but how can I use at the same time
>         "IntersectionStrict" or any of the other modes provided by
>         summarizeOverlaps if its mode argument is already used and
>         countOverlaps
>         doesn't accept one?
>
>         Thomas
>
>         On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote:
>
>             Thanks for the example. You're right, none of the modes will
>             count a
>             read falling completely within >1 feature.
>
>             You can supply your own function as a 'mode'. Two
>             requirements must be met:
>
>             (1) The function must have 3 arguments that correspond to
>                    'features', 'reads' and 'ignore.strand', in that order.
>
>             (2) The function must return a vector of counts the same length
>                    as 'features'
>
>             Here is an example using countOverlaps() which I think gets
>             at the
>             counting you want.
>
>             counter <- function(x, y,  ignore.strand)
>                    countOverlaps(y, x, ignore.strand=ignore.strand)
>
>                > assays(summarizeOverlaps(gr, rd, mode=counter))$counts
>                     [,1]
>             [1,]    1
>             [2,]    1
>
>
>             Valerie
>
>             On 04/09/2013 09:37 AM, Thomas Girke wrote:
>
>                 Hi Valerie,
>
>                 Perhaps let's call this more explicitly an
>                 ignore_inter_feature_overlap option
>                 that we often need for cases like this:
>
>                 Read1    ----
>                 F1 ----------------
>                 F2   -----------
>
>                 where we would like to have at least the option to
>                 assign Read1 to both F1 and F2:
>
>                 F1: 1
>                 F2: 1
>
>                 However, summarizeOverlapse doesn't count Read1 at all
>                 in all of its currently
>                 available modes that I am aware of. This lack of an
>                 ignore_inter_feature_overlap
>                 option is frequently a problem when working with poorly
>                 annotated genomes (high
>                 uncertainty about hidden/incorrect feature overlaps) or
>                 various
>                 RNA/ChIP-Seq _engineering_ projects where I rather take
>                 the risk of ambiguous read
>                 assignments than not counting at all as shown above.
>
>                 ## Here is a code example illustrating the same case:
>                 library(GenomicRanges); library(Rsamtools)
>                 rd <- GappedAlignments(letters[1], seqnames =
>                 Rle(rep("chr1",1)),
>                          pos = as.integer(c(500)),
>                          cigar = rep("100M", 1), strand =
>                 strand(rep("+", 1)))
>                 gr1 <- GRanges("chr1", IRanges(start=100, width=1001),
>                 strand="+", ID="feat1")
>                 gr2 <- GRanges("chr1", IRanges(start=500, width=101),
>                 strand="+", ID="feat2")
>                 gr <- c(gr1, gr2)
>
>                 ## All of the three current modes in summarizeOverlaps
>                 return a count of zero
>                 ## because of its inter_feature_overlap awareness:
>                 assays(summarizeOverlaps(gr, rd, mode="Union",
>                 ignore.strand=TRUE))$counts
>                 assays(summarizeOverlaps(gr, rd,
>                 mode="IntersectionStrict", ignore.strand=TRUE))$counts
>                 assays(summarizeOverlaps(gr, rd,
>                 mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts
>                         [,1]
>                 [1,]    0
>                 [2,]    0
>
>                 ## However, countOverlaps handles this correctly, but is
>                 not the best choice when
>                 ## counting multiple range features.
>                 countOverlaps(gr, rd)
>                 [1] 1 1
>
>
>                 Thomas
>
>                     sessionInfo()
>
>                 R version 3.0.0 (2013-04-03)
>                 Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
>                 locale:
>                 [1]
>                 en_US.UTF-8/en_US.UTF-8/en_US.__UTF-8/C/en_US.UTF-8/en_US.UTF-__8
>
>                 attached base packages:
>                 [1] parallel  stats     graphics  grDevices utils
>                 datasets  methods
>                 [8] base
>
>                 other attached packages:
>                 [1] Rsamtools_1.12.0     Biostrings_2.28.0
>                   GenomicRanges_1.12.1
>                 [4] IRanges_1.18.0       BiocGenerics_0.6.0
>
>                 loaded via a namespace (and not attached):
>                 [1] bitops_1.0-5   stats4_3.0.0   tools_3.0.0
>                   zlibbioc_1.6.0
>
>
>                 On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie
>                 Obenchain wrote:
>
>                     Hi Thomas,
>
>                     On 04/08/2013 05:52 PM, Thomas Girke wrote:
>
>                         Dear Valerie,
>
>                         Is there currently any way to run
>                         summarizeOverlaps in a feature-overlap
>                         unaware mode, e.g with an
>                         ignorefeatureOL=FALSE/TRUE setting? Currently,
>                         one can switch back to countOverlaps when
>                         feature overlap unawareness is
>                         the more appropriate counting mode for a
>                         biological question, but then
>                         double counting of reads mapping to
>                         multiple-range features is not
>                         accounted for. It would be really nice to have
>                         such a feature-overlap
>                         unaware option directly in summarizeOverlaps.
>
>
>                     No, we don't currently have an option to ignore
>                     feature-overlap. It
>                     sounds like you want to count with countOverlaps()
>                     but still want the
>                     double counting to be resolved. This is essentially
>                     what the other modes
>                     are doing so I must be missing something.
>
>                     In this example 2 reads hit feature A, 1 read hits
>                     feature B. With
>                     something like ignorefeature0L=TRUE, what results
>                     would you expect to
>                     see? Maybe you have another, more descriptive example?
>
>                     reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3))
>                     features <- GRanges("chr1", IRanges(c(1, 20), width=10,
>                                             names=c("A", "B")))
>
>                         > countOverlaps(features, reads)
>                     [1] 2 1
>
>
>
>                         Another question relates to the memory usage of
>                         summarizeOverlaps. Has
>                         this been optimized yet? On a typical bam file
>                         with ~50-100 million
>                         reads the memory usage of summarizeOverlaps is
>                         often around 10-20GB. To
>                         use the function on a desktop computer or in
>                         large-scale RNA-Seq
>                         projects on a commodity compute cluster, it
>                         would be desirable if every
>                         counting instance would consume not more than
>                         5GB of RAM.
>
>
>                     Have you tried the BamFileList-method? There is an
>                     example at the bottom
>                     of the ?BamFileList man page using
>                     summarizeOverlaps(). As Ryan
>                     mentioned, the key is to set the 'yieldSize'
>                     parameter when creating the
>                     BamFile. This method also makes use of mclapply().
>
>                     Valerie
>
>
>                         Thanks in advance for your help and suggestions,
>
>                         Thomas
>
>                         _________________________________________________
>                         Bioconductor mailing list
>                         Bioconductor at r-project.org
>                         <mailto:Bioconductor at r-project.org>
>                         https://stat.ethz.ch/mailman/__listinfo/bioconductor
>                         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>                         Search the archives:
>                         http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>                         <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>     _________________________________________________
>     Bioconductor mailing list
>     Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>     https://stat.ethz.ch/mailman/__listinfo/bioconductor
>     <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>     Search the archives:
>     http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>