[BioC] summarizeOverlaps mode ignoring inter feature overlaps

Fri Apr 12 04:11:51 CEST 2013

Hi Michael,

I could not find the 'summarizeSpliceOverlaps' function in the GenomicFeatures package (1.12.0). 

The findSpliceOverlaps function seems to count reads more than once as well. I ran the sample code in the help page for this function and found that the single fragment was assigned to both transcripts:

> genes <- GRangesList(
+     GRanges("chr1", IRanges(c(5, 20), c(10, 25)), "+"),
+     GRanges("chr1", IRanges(c(5, 22), c(15, 25)), "+"))
> genes
GRangesList of length 2:
[[1]] 
GRanges with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1  [ 5, 10]      +
  [2]     chr1  [20, 25]      +

[[2]] 
GRanges with 2 ranges and 0 metadata columns:
      seqnames   ranges strand
  [1]     chr1 [ 5, 15]      +
  [2]     chr1 [22, 25]      +

---
seqlengths:
 chr1
   NA
> galp <- GappedAlignmentPairs(
+     GappedAlignments("chr1", 5L, "11M4N6M", strand("+")),
+     GappedAlignments("chr1", 50L, "6M", strand("-")),
+     isProperPair=TRUE)
> galp
GappedAlignmentPairs with 1 alignment pair and 0 metadata columns:
    seqnames strand :    ranges --    ranges
       <Rle>  <Rle> : <IRanges> -- <IRanges>
[1]     chr1      + :   [5, 25] --  [50, 55]
---
seqlengths:
 chr1
   NA
> findSpliceOverlaps(galp, genes)
Hits of length 2
queryLength: 1
subjectLength: 2
  queryHits subjectHits compatible    unique    coding strandSpecific 
   <integer>   <integer>  <logical> <logical> <logical>      <logical> 
 1         1           1      FALSE     FALSE        NA           TRUE 
 2         1           2      FALSE     FALSE        NA           TRUE 

> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] limma_3.16.1           GenomicFeatures_1.12.0 AnnotationDbi_1.22.1   Biobase_2.20.0         GenomicRanges_1.12.1  
[6] IRanges_1.18.0         BiocGenerics_0.6.0     Rsubread_1.10.1        BiocInstaller_1.10.0  

loaded via a namespace (and not attached):
 [1] biomaRt_2.16.0     Biostrings_2.28.0  bitops_1.0-5       BSgenome_1.28.0    DBI_0.2-5          RCurl_1.95-4.1    
 [7] Rsamtools_1.12.0   RSQLite_0.11.2     rtracklayer_1.20.0 stats4_3.0.0       tools_3.0.0        XML_3.95-0.2      
[13] zlibbioc_1.6.0    
> 

Wei

On Apr 11, 2013, at 11:36 PM, Michael Lawrence wrote:

> On Wed, Apr 10, 2013 at 7:09 PM, Thomas Girke <thomas.girke at ucr.edu> wrote:
> 
>> Hi Martin,
>> 
>> Yes, inter_feature=TRUE would maintain the current counting mode(s) that
>> prohibits counting of reads mapping to multiple features. This is a
>> special case of counting that is very useful for counting exonic regions
>> of genes. However, one also wants to be able to turn off this behavior
>> by ignoring inter feature overlaps (just like ignore.strand=T/F).
>> Otherwise we cannot use summarizeOverlaps along with its current modes
>> for important operations like transcript level counting because many
>> transcript variants from the same gene will mask each others reads when
>> inter_feature=TRUE. Providing the option to output the results from both
>> inter_feature=TRUE and inter_feature=FALSE could be a very sensible
>> solution and time saver for users working with new genomes/GFFs, where
>> one cannot trust every nested annotation for various reasons, and
>> inter_feature=TRUE can quickly become a very risky counting strategy
>> since it tends to erase counts. Every biologist will scream in your
>> face if the counter tells them that their favorite gene has zero counts
>> just because of some overlap with some annotation error:).
>> 
>> For multiple range features stored in GRangesList objects, I would
>> currently favor making "inter_feature=FALSE" ignore the overlaps
>> occurring among different list components, but not necessarily those
>> within a list component (e.g. exon ranges of a gene). This way one
>> can benefit from the current infrastructure by restricting its feature
>> overlap scope to the range sets stored within individual list components
>> but ignoring those among different list components. Utilities like reduce
>> and other range modifier functions could handle situations where one wants
>> to ignore all feature overlaps within and among list components. However,
>> I am sure there could be other solutions to this.
>> 
>> Long story short if inter_feature=TRUE/FALSE could be used in
>> combination with modes=Union/IntersectionStrict/IntersectionNonEmpty
>> resulting in six different counting flavors, I would be happy and, I am
>> sure, many other Bioc users as well.
>> 
>> 
> Have you taken a look at findSpliceOverlaps? Maybe summarizeSpliceOverlaps
> could be completed to satisfy your use case? Somehow we need to control the
> proliferation of counting functions and modes. Having some idea of your
> high-level use case might help.
> 
> Also, for spliced alignments, I recommend giving GSNAP a try if you haven't
> already. It's accessible though the gmapR package.
> 
> Michael
> 
> I hope this makes sense.
>> 
>> Thomas
>> 
>> 
>> 
>> On Wed, Apr 10, 2013 at 11:37:04PM +0000, Martin Morgan wrote:
>>> On 04/10/2013 12:42 PM, Thomas Girke wrote:
>>>> Thanks. Adding an inter-feature unaware mode will be very helpful and
>> also
>>>> broaden summarizeOverlaps' application spectrum for a lot of use cases.
>>> 
>>> I'm probably being quite dense here, and am mostly an outside observer.
>> What I
>>> hear you saying is that there are currently three modes -- Union,
>>> IntersectionStrict, IntersectionNonEmpty. These modes are summarized in
>> the
>>> seven rows of figure 1 of
>>> 
>>>    vignette("summarizeOverlaps", package="GenomicRanges")
>>> 
>>> Let's say there is a flag inter_feature, and when its value is TRUE then
>> the
>>> current counting schemes are obtained. These modes differ in the way
>> counting
>>> works for reads illustrated in rows 5, 6, and 7 of the figure. You'd
>> like a
>>> count scored where a '1' appears in the table below. With
>> inter_feature=TRUE
>>> reads are 'never counted more than once' ('Hits per read' is <= 1)
>>> 
>>> |----------------------+-----+-----------+-----------+---------------|
>>> | Mode                 | Row | Feature 1 | Feature 2 | Hits per read |
>>> |----------------------+-----+-----------+-----------+---------------|
>>> | Union                |   5 | 1         | 0         | 1             |
>>> |                      |   6 | 0         | 0         | 0             |
>>> |                      |   7 | 0         | 0         | 0             |
>>> | IntersectionStrict   |   5 | 1         | 0         | 1             |
>>> |                      |   6 | 1         | 0         | 1             |
>>> |                      |   7 | 0         | 0         | 0             |
>>> | IntersectionNotEmpty |   5 | 1         | 0         | 1             |
>>> |                      |   6 | 1         | 0         | 1             |
>>> |                      |   7 | 0         | 0         | 0             |
>>> |----------------------+-----+-----------+-----------+---------------|
>>> 
>>> 
>>> You'd like to add 3 more modes, with inter_feature=FALSE. Reads are
>> sometimes
>>> 'counted twice'
>>> 
>>> |----------------------+-----+-----------+-----------+---------------|
>>> | Mode                 | Row | Feature 1 | Feature 2 | Hits per read |
>>> |----------------------+-----+-----------+-----------+---------------|
>>> | Union                |   5 |         1 |         0 |             1 |
>>> |                      |   6 |         1 |         1 |             2 |
>>> |                      |   7 |         1 |         1 |             2 |
>>> | IntersectionStrict   |   5 |         1 |         0 |             1 |
>>> |                      |   6 |         1 |         0 |             1 |
>>> |                      |   7 |         1 |         1 |             2 |
>>> | IntersectionNotEmpty |   5 |         1 |         0 |             1 |
>>> |                      |   6 |         1 |         1 |             2 |
>>> |                      |   7 |         1 |         1 |             2 |
>>> |----------------------+-----+-----------+-----------+---------------|
>>> 
>>> 
>>> Martin
>>> 
>>>> 
>>>> Thomas
>>>> 
>>>> 
>>>> On Wed, Apr 10, 2013 at 04:59:49PM +0000, Valerie Obenchain wrote:
>>>>> Hi Thomas,
>>>>> 
>>>>> On 04/10/2013 09:23 AM, Thomas Girke wrote:
>>>>>> Valerie,
>>>>>> 
>>>>>> Please see my inserted comments below.
>>>>>> 
>>>>>> On Wed, Apr 10, 2013 at 03:50:54PM +0000, Valerie Obenchain wrote:
>>>>>>> Ah, I see. You'd like to count with one of the existing modes but
>> have
>>>>>>> the option to pick up counts for these inter-feature reads (fall
>>>>>>> completely 'within' >1 feature). These inter-feature reads would be
>>>>>>> double (triple, quadruple, etc.) counted. Essentially they would
>> add one
>>>>>>> count to each feature they hit. Right?
>>>>>> 
>>>>>> Correct. Perhaps let's discuss this with a very common example of
>>>>>> transcript-level counting rather than counting on the unified
>> (virtual)
>>>>>> exonic regions of genes. With the current description provided in the
>>>>>> summarizeOverlaps vignette at
>>>>>> 
>> http://www.bioconductor.org/packages/2.12/bioc/vignettes/GenomicRanges/inst/doc/summarizeOverlaps.pdf
>>>>>> I don't see how this can be achieved without falling back to using
>>>>>> countOverlaps without loosing the new counting modes provided by
>>>>>> summarizeOverlaps?
>>>>> 
>>>>> It can't be achieved with the function as is but we could add an
>>>>> argument to handle this (as you suggested from the start). If
>>>>> 'inter-feature=TRUE' then these counts would be added to the counts
>>>>> already obtained from using a particular 'mode'. I will move ahead
>> with
>>>>> implementing this argument.
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> One more thought about memory usage. If you are working with
>> single-end
>>>>>>> reads the summarizeOverlaps,BamFileList-method I mentioned below
>> should
>>>>>>> work fine. The approach is slightly different with paired-end
>> reads. Our
>>>>>>> current algorithm for pairing paired-end reads requires the whole
>> file
>>>>>>> to be read into memory. A different approach is currently being
>>>>>>> developed but in the meantime you can take the qname-sorted
>> approach.
>>>>>>> The Bam file will need to be sorted by qname and both the
>> 'yieldSize'
>>>>>>> and 'obeyQname=TRUE' set when creating the BamFile/BamFileList. An
>>>>>>> example is on ?BamFile.
>>>>>> 
>>>>>> Thanks for pointing this out. My fault that I didn't read through all
>>>>>> the linked documentation. Perhaps it may not be a bad idea to make
>> the
>>>>>> memory restricted bam read instances the default setting in the
>> future.
>>>>>> This will definitely help biologists using those utilities without
>>>>>> crashing their machines on the first attempt.
>>>>> 
>>>>> Good suggestion, thanks.
>>>>> 
>>>>> Valerie
>>>>> 
>>>>>> 
>>>>>> Thomas
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> Valerie
>>>>>>> 
>>>>>>> 
>>>>>>> On 04/09/2013 08:01 PM, Thomas Girke wrote:
>>>>>>>> Thanks for the tip. I guess doing it this way reverses the
>> counting mode
>>>>>>>> back to countOverlaps, but how can I use at the same time
>>>>>>>> "IntersectionStrict" or any of the other modes provided by
>>>>>>>> summarizeOverlaps if its mode argument is already used and
>> countOverlaps
>>>>>>>> doesn't accept one?
>>>>>>>> 
>>>>>>>> Thomas
>>>>>>>> 
>>>>>>>> On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote:
>>>>>>>>> Thanks for the example. You're right, none of the modes will
>> count a
>>>>>>>>> read falling completely within >1 feature.
>>>>>>>>> 
>>>>>>>>> You can supply your own function as a 'mode'. Two requirements
>> must be met:
>>>>>>>>> 
>>>>>>>>> (1) The function must have 3 arguments that correspond to
>>>>>>>>>        'features', 'reads' and 'ignore.strand', in that order.
>>>>>>>>> 
>>>>>>>>> (2) The function must return a vector of counts the same length
>>>>>>>>>        as 'features'
>>>>>>>>> 
>>>>>>>>> Here is an example using countOverlaps() which I think gets at the
>>>>>>>>> counting you want.
>>>>>>>>> 
>>>>>>>>> counter <- function(x, y,  ignore.strand)
>>>>>>>>>        countOverlaps(y, x, ignore.strand=ignore.strand)
>>>>>>>>> 
>>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode=counter))$counts
>>>>>>>>>         [,1]
>>>>>>>>> [1,]    1
>>>>>>>>> [2,]    1
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Valerie
>>>>>>>>> 
>>>>>>>>> On 04/09/2013 09:37 AM, Thomas Girke wrote:
>>>>>>>>>> Hi Valerie,
>>>>>>>>>> 
>>>>>>>>>> Perhaps let's call this more explicitly an
>> ignore_inter_feature_overlap option
>>>>>>>>>> that we often need for cases like this:
>>>>>>>>>> 
>>>>>>>>>> Read1    ----
>>>>>>>>>> F1 ----------------
>>>>>>>>>> F2   -----------
>>>>>>>>>> 
>>>>>>>>>> where we would like to have at least the option to assign Read1
>> to both F1 and F2:
>>>>>>>>>> 
>>>>>>>>>> F1: 1
>>>>>>>>>> F2: 1
>>>>>>>>>> 
>>>>>>>>>> However, summarizeOverlapse doesn't count Read1 at all in all of
>> its currently
>>>>>>>>>> available modes that I am aware of. This lack of an
>> ignore_inter_feature_overlap
>>>>>>>>>> option is frequently a problem when working with poorly
>> annotated genomes (high
>>>>>>>>>> uncertainty about hidden/incorrect feature overlaps) or various
>>>>>>>>>> RNA/ChIP-Seq _engineering_ projects where I rather take the risk
>> of ambiguous read
>>>>>>>>>> assignments than not counting at all as shown above.
>>>>>>>>>> 
>>>>>>>>>> ## Here is a code example illustrating the same case:
>>>>>>>>>> library(GenomicRanges); library(Rsamtools)
>>>>>>>>>> rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)),
>>>>>>>>>>          pos = as.integer(c(500)),
>>>>>>>>>>          cigar = rep("100M", 1), strand = strand(rep("+", 1)))
>>>>>>>>>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001),
>> strand="+", ID="feat1")
>>>>>>>>>> gr2 <- GRanges("chr1", IRanges(start=500, width=101),
>> strand="+", ID="feat2")
>>>>>>>>>> gr <- c(gr1, gr2)
>>>>>>>>>> 
>>>>>>>>>> ## All of the three current modes in summarizeOverlaps return a
>> count of zero
>>>>>>>>>> ## because of its inter_feature_overlap awareness:
>>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="Union",
>> ignore.strand=TRUE))$counts
>>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict",
>> ignore.strand=TRUE))$counts
>>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty",
>> ignore.strand=TRUE))$counts
>>>>>>>>>>         [,1]
>>>>>>>>>> [1,]    0
>>>>>>>>>> [2,]    0
>>>>>>>>>> 
>>>>>>>>>> ## However, countOverlaps handles this correctly, but is not the
>> best choice when
>>>>>>>>>> ## counting multiple range features.
>>>>>>>>>> countOverlaps(gr, rd)
>>>>>>>>>> [1] 1 1
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thomas
>>>>>>>>>> 
>>>>>>>>>>> sessionInfo()
>>>>>>>>>> R version 3.0.0 (2013-04-03)
>>>>>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>>>>>>>>> 
>>>>>>>>>> locale:
>>>>>>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>>>>>>>> 
>>>>>>>>>> attached base packages:
>>>>>>>>>> [1] parallel  stats     graphics  grDevices utils     datasets
>> methods
>>>>>>>>>> [8] base
>>>>>>>>>> 
>>>>>>>>>> other attached packages:
>>>>>>>>>> [1] Rsamtools_1.12.0     Biostrings_2.28.0
>> GenomicRanges_1.12.1
>>>>>>>>>> [4] IRanges_1.18.0       BiocGenerics_0.6.0
>>>>>>>>>> 
>>>>>>>>>> loaded via a namespace (and not attached):
>>>>>>>>>> [1] bitops_1.0-5   stats4_3.0.0   tools_3.0.0    zlibbioc_1.6.0
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain
>> wrote:
>>>>>>>>>>> Hi Thomas,
>>>>>>>>>>> 
>>>>>>>>>>> On 04/08/2013 05:52 PM, Thomas Girke wrote:
>>>>>>>>>>>> Dear Valerie,
>>>>>>>>>>>> 
>>>>>>>>>>>> Is there currently any way to run summarizeOverlaps in a
>> feature-overlap
>>>>>>>>>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting?
>> Currently,
>>>>>>>>>>>> one can switch back to countOverlaps when feature overlap
>> unawareness is
>>>>>>>>>>>> the more appropriate counting mode for a biological question,
>> but then
>>>>>>>>>>>> double counting of reads mapping to multiple-range features is
>> not
>>>>>>>>>>>> accounted for. It would be really nice to have such a
>> feature-overlap
>>>>>>>>>>>> unaware option directly in summarizeOverlaps.
>>>>>>>>>>> 
>>>>>>>>>>> No, we don't currently have an option to ignore
>> feature-overlap. It
>>>>>>>>>>> sounds like you want to count with countOverlaps() but still
>> want the
>>>>>>>>>>> double counting to be resolved. This is essentially what the
>> other modes
>>>>>>>>>>> are doing so I must be missing something.
>>>>>>>>>>> 
>>>>>>>>>>> In this example 2 reads hit feature A, 1 read hits feature B.
>> With
>>>>>>>>>>> something like ignorefeature0L=TRUE, what results would you
>> expect to
>>>>>>>>>>> see? Maybe you have another, more descriptive example?
>>>>>>>>>>> 
>>>>>>>>>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3))
>>>>>>>>>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10,
>>>>>>>>>>>                         names=c("A", "B")))
>>>>>>>>>>> 
>>>>>>>>>>>> countOverlaps(features, reads)
>>>>>>>>>>> [1] 2 1
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Another question relates to the memory usage of
>> summarizeOverlaps. Has
>>>>>>>>>>>> this been optimized yet? On a typical bam file with ~50-100
>> million
>>>>>>>>>>>> reads the memory usage of summarizeOverlaps is often around
>> 10-20GB. To
>>>>>>>>>>>> use the function on a desktop computer or in large-scale
>> RNA-Seq
>>>>>>>>>>>> projects on a commodity compute cluster, it would be desirable
>> if every
>>>>>>>>>>>> counting instance would consume not more than 5GB of RAM.
>>>>>>>>>>> 
>>>>>>>>>>> Have you tried the BamFileList-method? There is an example at
>> the bottom
>>>>>>>>>>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan
>>>>>>>>>>> mentioned, the key is to set the 'yieldSize' parameter when
>> creating the
>>>>>>>>>>> BamFile. This method also makes use of mclapply().
>>>>>>>>>>> 
>>>>>>>>>>> Valerie
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks in advance for your help and suggestions,
>>>>>>>>>>>> 
>>>>>>>>>>>> Thomas
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Bioconductor mailing list
>>>>>>>>>>>> Bioconductor at r-project.org
>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>>>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>>>>> 
>>>> 
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>> 
>>> 
>>> 
>>> --
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>> 
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}