[BioC] summarizeOverlaps mode ignoring inter feature overlaps

Martin Morgan mtmorgan at fhcrc.org
Thu Apr 11 01:37:04 CEST 2013


On 04/10/2013 12:42 PM, Thomas Girke wrote:
> Thanks. Adding an inter-feature unaware mode will be very helpful and also
> broaden summarizeOverlaps' application spectrum for a lot of use cases.

I'm probably being quite dense here, and am mostly an outside observer. What I 
hear you saying is that there are currently three modes -- Union, 
IntersectionStrict, IntersectionNonEmpty. These modes are summarized in the 
seven rows of figure 1 of

    vignette("summarizeOverlaps", package="GenomicRanges")

Let's say there is a flag inter_feature, and when its value is TRUE then the 
current counting schemes are obtained. These modes differ in the way counting 
works for reads illustrated in rows 5, 6, and 7 of the figure. You'd like a 
count scored where a '1' appears in the table below. With inter_feature=TRUE 
reads are 'never counted more than once' ('Hits per read' is <= 1)

|----------------------+-----+-----------+-----------+---------------|
| Mode                 | Row | Feature 1 | Feature 2 | Hits per read |
|----------------------+-----+-----------+-----------+---------------|
| Union                |   5 | 1         | 0         | 1             |
|                      |   6 | 0         | 0         | 0             |
|                      |   7 | 0         | 0         | 0             |
| IntersectionStrict   |   5 | 1         | 0         | 1             |
|                      |   6 | 1         | 0         | 1             |
|                      |   7 | 0         | 0         | 0             |
| IntersectionNotEmpty |   5 | 1         | 0         | 1             |
|                      |   6 | 1         | 0         | 1             |
|                      |   7 | 0         | 0         | 0             |
|----------------------+-----+-----------+-----------+---------------|


You'd like to add 3 more modes, with inter_feature=FALSE. Reads are sometimes 
'counted twice'

|----------------------+-----+-----------+-----------+---------------|
| Mode                 | Row | Feature 1 | Feature 2 | Hits per read |
|----------------------+-----+-----------+-----------+---------------|
| Union                |   5 |         1 |         0 |             1 |
|                      |   6 |         1 |         1 |             2 |
|                      |   7 |         1 |         1 |             2 |
| IntersectionStrict   |   5 |         1 |         0 |             1 |
|                      |   6 |         1 |         0 |             1 |
|                      |   7 |         1 |         1 |             2 |
| IntersectionNotEmpty |   5 |         1 |         0 |             1 |
|                      |   6 |         1 |         1 |             2 |
|                      |   7 |         1 |         1 |             2 |
|----------------------+-----+-----------+-----------+---------------|


Martin

>
> Thomas
>
>
> On Wed, Apr 10, 2013 at 04:59:49PM +0000, Valerie Obenchain wrote:
>> Hi Thomas,
>>
>> On 04/10/2013 09:23 AM, Thomas Girke wrote:
>>> Valerie,
>>>
>>> Please see my inserted comments below.
>>>
>>> On Wed, Apr 10, 2013 at 03:50:54PM +0000, Valerie Obenchain wrote:
>>>> Ah, I see. You'd like to count with one of the existing modes but have
>>>> the option to pick up counts for these inter-feature reads (fall
>>>> completely 'within' >1 feature). These inter-feature reads would be
>>>> double (triple, quadruple, etc.) counted. Essentially they would add one
>>>> count to each feature they hit. Right?
>>>
>>> Correct. Perhaps let's discuss this with a very common example of
>>> transcript-level counting rather than counting on the unified (virtual)
>>> exonic regions of genes. With the current description provided in the
>>> summarizeOverlaps vignette at
>>> http://www.bioconductor.org/packages/2.12/bioc/vignettes/GenomicRanges/inst/doc/summarizeOverlaps.pdf
>>> I don't see how this can be achieved without falling back to using
>>> countOverlaps without loosing the new counting modes provided by
>>> summarizeOverlaps?
>>
>> It can't be achieved with the function as is but we could add an
>> argument to handle this (as you suggested from the start). If
>> 'inter-feature=TRUE' then these counts would be added to the counts
>> already obtained from using a particular 'mode'. I will move ahead with
>> implementing this argument.
>>
>>>
>>>>
>>>> One more thought about memory usage. If you are working with single-end
>>>> reads the summarizeOverlaps,BamFileList-method I mentioned below should
>>>> work fine. The approach is slightly different with paired-end reads. Our
>>>> current algorithm for pairing paired-end reads requires the whole file
>>>> to be read into memory. A different approach is currently being
>>>> developed but in the meantime you can take the qname-sorted approach.
>>>> The Bam file will need to be sorted by qname and both the 'yieldSize'
>>>> and 'obeyQname=TRUE' set when creating the BamFile/BamFileList. An
>>>> example is on ?BamFile.
>>>
>>> Thanks for pointing this out. My fault that I didn't read through all
>>> the linked documentation. Perhaps it may not be a bad idea to make the
>>> memory restricted bam read instances the default setting in the future.
>>> This will definitely help biologists using those utilities without
>>> crashing their machines on the first attempt.
>>
>> Good suggestion, thanks.
>>
>> Valerie
>>
>>>
>>> Thomas
>>>
>>>
>>>>
>>>> Valerie
>>>>
>>>>
>>>> On 04/09/2013 08:01 PM, Thomas Girke wrote:
>>>>> Thanks for the tip. I guess doing it this way reverses the counting mode
>>>>> back to countOverlaps, but how can I use at the same time
>>>>> "IntersectionStrict" or any of the other modes provided by
>>>>> summarizeOverlaps if its mode argument is already used and countOverlaps
>>>>> doesn't accept one?
>>>>>
>>>>> Thomas
>>>>>
>>>>> On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote:
>>>>>> Thanks for the example. You're right, none of the modes will count a
>>>>>> read falling completely within >1 feature.
>>>>>>
>>>>>> You can supply your own function as a 'mode'. Two requirements must be met:
>>>>>>
>>>>>> (1) The function must have 3 arguments that correspond to
>>>>>>         'features', 'reads' and 'ignore.strand', in that order.
>>>>>>
>>>>>> (2) The function must return a vector of counts the same length
>>>>>>         as 'features'
>>>>>>
>>>>>> Here is an example using countOverlaps() which I think gets at the
>>>>>> counting you want.
>>>>>>
>>>>>> counter <- function(x, y,  ignore.strand)
>>>>>>         countOverlaps(y, x, ignore.strand=ignore.strand)
>>>>>>
>>>>>>     > assays(summarizeOverlaps(gr, rd, mode=counter))$counts
>>>>>>          [,1]
>>>>>> [1,]    1
>>>>>> [2,]    1
>>>>>>
>>>>>>
>>>>>> Valerie
>>>>>>
>>>>>> On 04/09/2013 09:37 AM, Thomas Girke wrote:
>>>>>>> Hi Valerie,
>>>>>>>
>>>>>>> Perhaps let's call this more explicitly an ignore_inter_feature_overlap option
>>>>>>> that we often need for cases like this:
>>>>>>>
>>>>>>> Read1    ----
>>>>>>> F1 ----------------
>>>>>>> F2   -----------
>>>>>>>
>>>>>>> where we would like to have at least the option to assign Read1 to both F1 and F2:
>>>>>>>
>>>>>>> F1: 1
>>>>>>> F2: 1
>>>>>>>
>>>>>>> However, summarizeOverlapse doesn't count Read1 at all in all of its currently
>>>>>>> available modes that I am aware of. This lack of an ignore_inter_feature_overlap
>>>>>>> option is frequently a problem when working with poorly annotated genomes (high
>>>>>>> uncertainty about hidden/incorrect feature overlaps) or various
>>>>>>> RNA/ChIP-Seq _engineering_ projects where I rather take the risk of ambiguous read
>>>>>>> assignments than not counting at all as shown above.
>>>>>>>
>>>>>>> ## Here is a code example illustrating the same case:
>>>>>>> library(GenomicRanges); library(Rsamtools)
>>>>>>> rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)),
>>>>>>>           pos = as.integer(c(500)),
>>>>>>>           cigar = rep("100M", 1), strand = strand(rep("+", 1)))
>>>>>>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001), strand="+", ID="feat1")
>>>>>>> gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="feat2")
>>>>>>> gr <- c(gr1, gr2)
>>>>>>>
>>>>>>> ## All of the three current modes in summarizeOverlaps return a count of zero
>>>>>>> ## because of its inter_feature_overlap awareness:
>>>>>>> assays(summarizeOverlaps(gr, rd, mode="Union", ignore.strand=TRUE))$counts
>>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict", ignore.strand=TRUE))$counts
>>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty", ignore.strand=TRUE))$counts
>>>>>>>          [,1]
>>>>>>> [1,]    0
>>>>>>> [2,]    0
>>>>>>>
>>>>>>> ## However, countOverlaps handles this correctly, but is not the best choice when
>>>>>>> ## counting multiple range features.
>>>>>>> countOverlaps(gr, rd)
>>>>>>> [1] 1 1
>>>>>>>
>>>>>>>
>>>>>>> Thomas
>>>>>>>
>>>>>>>> sessionInfo()
>>>>>>> R version 3.0.0 (2013-04-03)
>>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>>>>>>
>>>>>>> locale:
>>>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>>>>>
>>>>>>> attached base packages:
>>>>>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>>>>>> [8] base
>>>>>>>
>>>>>>> other attached packages:
>>>>>>> [1] Rsamtools_1.12.0     Biostrings_2.28.0    GenomicRanges_1.12.1
>>>>>>> [4] IRanges_1.18.0       BiocGenerics_0.6.0
>>>>>>>
>>>>>>> loaded via a namespace (and not attached):
>>>>>>> [1] bitops_1.0-5   stats4_3.0.0   tools_3.0.0    zlibbioc_1.6.0
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain wrote:
>>>>>>>> Hi Thomas,
>>>>>>>>
>>>>>>>> On 04/08/2013 05:52 PM, Thomas Girke wrote:
>>>>>>>>> Dear Valerie,
>>>>>>>>>
>>>>>>>>> Is there currently any way to run summarizeOverlaps in a feature-overlap
>>>>>>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting? Currently,
>>>>>>>>> one can switch back to countOverlaps when feature overlap unawareness is
>>>>>>>>> the more appropriate counting mode for a biological question, but then
>>>>>>>>> double counting of reads mapping to multiple-range features is not
>>>>>>>>> accounted for. It would be really nice to have such a feature-overlap
>>>>>>>>> unaware option directly in summarizeOverlaps.
>>>>>>>>
>>>>>>>> No, we don't currently have an option to ignore feature-overlap. It
>>>>>>>> sounds like you want to count with countOverlaps() but still want the
>>>>>>>> double counting to be resolved. This is essentially what the other modes
>>>>>>>> are doing so I must be missing something.
>>>>>>>>
>>>>>>>> In this example 2 reads hit feature A, 1 read hits feature B. With
>>>>>>>> something like ignorefeature0L=TRUE, what results would you expect to
>>>>>>>> see? Maybe you have another, more descriptive example?
>>>>>>>>
>>>>>>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3))
>>>>>>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10,
>>>>>>>>                          names=c("A", "B")))
>>>>>>>>
>>>>>>>>      > countOverlaps(features, reads)
>>>>>>>> [1] 2 1
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Another question relates to the memory usage of summarizeOverlaps. Has
>>>>>>>>> this been optimized yet? On a typical bam file with ~50-100 million
>>>>>>>>> reads the memory usage of summarizeOverlaps is often around 10-20GB. To
>>>>>>>>> use the function on a desktop computer or in large-scale RNA-Seq
>>>>>>>>> projects on a commodity compute cluster, it would be desirable if every
>>>>>>>>> counting instance would consume not more than 5GB of RAM.
>>>>>>>>
>>>>>>>> Have you tried the BamFileList-method? There is an example at the bottom
>>>>>>>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan
>>>>>>>> mentioned, the key is to set the 'yieldSize' parameter when creating the
>>>>>>>> BamFile. This method also makes use of mclapply().
>>>>>>>>
>>>>>>>> Valerie
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks in advance for your help and suggestions,
>>>>>>>>>
>>>>>>>>> Thomas
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioconductor mailing list
>>>>>>>>> Bioconductor at r-project.org
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list