[BioC] summarizeOverlaps mode ignoring inter feature overlaps
Wei Shi
shi at wehi.EDU.AU
Fri Apr 12 04:11:51 CEST 2013
Hi Michael,
I could not find the 'summarizeSpliceOverlaps' function in the GenomicFeatures package (1.12.0).
The findSpliceOverlaps function seems to count reads more than once as well. I ran the sample code in the help page for this function and found that the single fragment was assigned to both transcripts:
> genes <- GRangesList(
+ GRanges("chr1", IRanges(c(5, 20), c(10, 25)), "+"),
+ GRanges("chr1", IRanges(c(5, 22), c(15, 25)), "+"))
> genes
GRangesList of length 2:
[[1]]
GRanges with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [ 5, 10] +
[2] chr1 [20, 25] +
[[2]]
GRanges with 2 ranges and 0 metadata columns:
seqnames ranges strand
[1] chr1 [ 5, 15] +
[2] chr1 [22, 25] +
---
seqlengths:
chr1
NA
> galp <- GappedAlignmentPairs(
+ GappedAlignments("chr1", 5L, "11M4N6M", strand("+")),
+ GappedAlignments("chr1", 50L, "6M", strand("-")),
+ isProperPair=TRUE)
> galp
GappedAlignmentPairs with 1 alignment pair and 0 metadata columns:
seqnames strand : ranges -- ranges
<Rle> <Rle> : <IRanges> -- <IRanges>
[1] chr1 + : [5, 25] -- [50, 55]
---
seqlengths:
chr1
NA
> findSpliceOverlaps(galp, genes)
Hits of length 2
queryLength: 1
subjectLength: 2
queryHits subjectHits compatible unique coding strandSpecific
<integer> <integer> <logical> <logical> <logical> <logical>
1 1 1 FALSE FALSE NA TRUE
2 1 2 FALSE FALSE NA TRUE
> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] limma_3.16.1 GenomicFeatures_1.12.0 AnnotationDbi_1.22.1 Biobase_2.20.0 GenomicRanges_1.12.1
[6] IRanges_1.18.0 BiocGenerics_0.6.0 Rsubread_1.10.1 BiocInstaller_1.10.0
loaded via a namespace (and not attached):
[1] biomaRt_2.16.0 Biostrings_2.28.0 bitops_1.0-5 BSgenome_1.28.0 DBI_0.2-5 RCurl_1.95-4.1
[7] Rsamtools_1.12.0 RSQLite_0.11.2 rtracklayer_1.20.0 stats4_3.0.0 tools_3.0.0 XML_3.95-0.2
[13] zlibbioc_1.6.0
>
Wei
On Apr 11, 2013, at 11:36 PM, Michael Lawrence wrote:
> On Wed, Apr 10, 2013 at 7:09 PM, Thomas Girke <thomas.girke at ucr.edu> wrote:
>
>> Hi Martin,
>>
>> Yes, inter_feature=TRUE would maintain the current counting mode(s) that
>> prohibits counting of reads mapping to multiple features. This is a
>> special case of counting that is very useful for counting exonic regions
>> of genes. However, one also wants to be able to turn off this behavior
>> by ignoring inter feature overlaps (just like ignore.strand=T/F).
>> Otherwise we cannot use summarizeOverlaps along with its current modes
>> for important operations like transcript level counting because many
>> transcript variants from the same gene will mask each others reads when
>> inter_feature=TRUE. Providing the option to output the results from both
>> inter_feature=TRUE and inter_feature=FALSE could be a very sensible
>> solution and time saver for users working with new genomes/GFFs, where
>> one cannot trust every nested annotation for various reasons, and
>> inter_feature=TRUE can quickly become a very risky counting strategy
>> since it tends to erase counts. Every biologist will scream in your
>> face if the counter tells them that their favorite gene has zero counts
>> just because of some overlap with some annotation error:).
>>
>> For multiple range features stored in GRangesList objects, I would
>> currently favor making "inter_feature=FALSE" ignore the overlaps
>> occurring among different list components, but not necessarily those
>> within a list component (e.g. exon ranges of a gene). This way one
>> can benefit from the current infrastructure by restricting its feature
>> overlap scope to the range sets stored within individual list components
>> but ignoring those among different list components. Utilities like reduce
>> and other range modifier functions could handle situations where one wants
>> to ignore all feature overlaps within and among list components. However,
>> I am sure there could be other solutions to this.
>>
>> Long story short if inter_feature=TRUE/FALSE could be used in
>> combination with modes=Union/IntersectionStrict/IntersectionNonEmpty
>> resulting in six different counting flavors, I would be happy and, I am
>> sure, many other Bioc users as well.
>>
>>
> Have you taken a look at findSpliceOverlaps? Maybe summarizeSpliceOverlaps
> could be completed to satisfy your use case? Somehow we need to control the
> proliferation of counting functions and modes. Having some idea of your
> high-level use case might help.
>
> Also, for spliced alignments, I recommend giving GSNAP a try if you haven't
> already. It's accessible though the gmapR package.
>
> Michael
>
> I hope this makes sense.
>>
>> Thomas
>>
>>
>>
>> On Wed, Apr 10, 2013 at 11:37:04PM +0000, Martin Morgan wrote:
>>> On 04/10/2013 12:42 PM, Thomas Girke wrote:
>>>> Thanks. Adding an inter-feature unaware mode will be very helpful and
>> also
>>>> broaden summarizeOverlaps' application spectrum for a lot of use cases.
>>>
>>> I'm probably being quite dense here, and am mostly an outside observer.
>> What I
>>> hear you saying is that there are currently three modes -- Union,
>>> IntersectionStrict, IntersectionNonEmpty. These modes are summarized in
>> the
>>> seven rows of figure 1 of
>>>
>>> vignette("summarizeOverlaps", package="GenomicRanges")
>>>
>>> Let's say there is a flag inter_feature, and when its value is TRUE then
>> the
>>> current counting schemes are obtained. These modes differ in the way
>> counting
>>> works for reads illustrated in rows 5, 6, and 7 of the figure. You'd
>> like a
>>> count scored where a '1' appears in the table below. With
>> inter_feature=TRUE
>>> reads are 'never counted more than once' ('Hits per read' is <= 1)
>>>
>>> |----------------------+-----+-----------+-----------+---------------|
>>> | Mode | Row | Feature 1 | Feature 2 | Hits per read |
>>> |----------------------+-----+-----------+-----------+---------------|
>>> | Union | 5 | 1 | 0 | 1 |
>>> | | 6 | 0 | 0 | 0 |
>>> | | 7 | 0 | 0 | 0 |
>>> | IntersectionStrict | 5 | 1 | 0 | 1 |
>>> | | 6 | 1 | 0 | 1 |
>>> | | 7 | 0 | 0 | 0 |
>>> | IntersectionNotEmpty | 5 | 1 | 0 | 1 |
>>> | | 6 | 1 | 0 | 1 |
>>> | | 7 | 0 | 0 | 0 |
>>> |----------------------+-----+-----------+-----------+---------------|
>>>
>>>
>>> You'd like to add 3 more modes, with inter_feature=FALSE. Reads are
>> sometimes
>>> 'counted twice'
>>>
>>> |----------------------+-----+-----------+-----------+---------------|
>>> | Mode | Row | Feature 1 | Feature 2 | Hits per read |
>>> |----------------------+-----+-----------+-----------+---------------|
>>> | Union | 5 | 1 | 0 | 1 |
>>> | | 6 | 1 | 1 | 2 |
>>> | | 7 | 1 | 1 | 2 |
>>> | IntersectionStrict | 5 | 1 | 0 | 1 |
>>> | | 6 | 1 | 0 | 1 |
>>> | | 7 | 1 | 1 | 2 |
>>> | IntersectionNotEmpty | 5 | 1 | 0 | 1 |
>>> | | 6 | 1 | 1 | 2 |
>>> | | 7 | 1 | 1 | 2 |
>>> |----------------------+-----+-----------+-----------+---------------|
>>>
>>>
>>> Martin
>>>
>>>>
>>>> Thomas
>>>>
>>>>
>>>> On Wed, Apr 10, 2013 at 04:59:49PM +0000, Valerie Obenchain wrote:
>>>>> Hi Thomas,
>>>>>
>>>>> On 04/10/2013 09:23 AM, Thomas Girke wrote:
>>>>>> Valerie,
>>>>>>
>>>>>> Please see my inserted comments below.
>>>>>>
>>>>>> On Wed, Apr 10, 2013 at 03:50:54PM +0000, Valerie Obenchain wrote:
>>>>>>> Ah, I see. You'd like to count with one of the existing modes but
>> have
>>>>>>> the option to pick up counts for these inter-feature reads (fall
>>>>>>> completely 'within' >1 feature). These inter-feature reads would be
>>>>>>> double (triple, quadruple, etc.) counted. Essentially they would
>> add one
>>>>>>> count to each feature they hit. Right?
>>>>>>
>>>>>> Correct. Perhaps let's discuss this with a very common example of
>>>>>> transcript-level counting rather than counting on the unified
>> (virtual)
>>>>>> exonic regions of genes. With the current description provided in the
>>>>>> summarizeOverlaps vignette at
>>>>>>
>> http://www.bioconductor.org/packages/2.12/bioc/vignettes/GenomicRanges/inst/doc/summarizeOverlaps.pdf
>>>>>> I don't see how this can be achieved without falling back to using
>>>>>> countOverlaps without loosing the new counting modes provided by
>>>>>> summarizeOverlaps?
>>>>>
>>>>> It can't be achieved with the function as is but we could add an
>>>>> argument to handle this (as you suggested from the start). If
>>>>> 'inter-feature=TRUE' then these counts would be added to the counts
>>>>> already obtained from using a particular 'mode'. I will move ahead
>> with
>>>>> implementing this argument.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> One more thought about memory usage. If you are working with
>> single-end
>>>>>>> reads the summarizeOverlaps,BamFileList-method I mentioned below
>> should
>>>>>>> work fine. The approach is slightly different with paired-end
>> reads. Our
>>>>>>> current algorithm for pairing paired-end reads requires the whole
>> file
>>>>>>> to be read into memory. A different approach is currently being
>>>>>>> developed but in the meantime you can take the qname-sorted
>> approach.
>>>>>>> The Bam file will need to be sorted by qname and both the
>> 'yieldSize'
>>>>>>> and 'obeyQname=TRUE' set when creating the BamFile/BamFileList. An
>>>>>>> example is on ?BamFile.
>>>>>>
>>>>>> Thanks for pointing this out. My fault that I didn't read through all
>>>>>> the linked documentation. Perhaps it may not be a bad idea to make
>> the
>>>>>> memory restricted bam read instances the default setting in the
>> future.
>>>>>> This will definitely help biologists using those utilities without
>>>>>> crashing their machines on the first attempt.
>>>>>
>>>>> Good suggestion, thanks.
>>>>>
>>>>> Valerie
>>>>>
>>>>>>
>>>>>> Thomas
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Valerie
>>>>>>>
>>>>>>>
>>>>>>> On 04/09/2013 08:01 PM, Thomas Girke wrote:
>>>>>>>> Thanks for the tip. I guess doing it this way reverses the
>> counting mode
>>>>>>>> back to countOverlaps, but how can I use at the same time
>>>>>>>> "IntersectionStrict" or any of the other modes provided by
>>>>>>>> summarizeOverlaps if its mode argument is already used and
>> countOverlaps
>>>>>>>> doesn't accept one?
>>>>>>>>
>>>>>>>> Thomas
>>>>>>>>
>>>>>>>> On Tue, Apr 09, 2013 at 09:08:02PM +0000, Valerie Obenchain wrote:
>>>>>>>>> Thanks for the example. You're right, none of the modes will
>> count a
>>>>>>>>> read falling completely within >1 feature.
>>>>>>>>>
>>>>>>>>> You can supply your own function as a 'mode'. Two requirements
>> must be met:
>>>>>>>>>
>>>>>>>>> (1) The function must have 3 arguments that correspond to
>>>>>>>>> 'features', 'reads' and 'ignore.strand', in that order.
>>>>>>>>>
>>>>>>>>> (2) The function must return a vector of counts the same length
>>>>>>>>> as 'features'
>>>>>>>>>
>>>>>>>>> Here is an example using countOverlaps() which I think gets at the
>>>>>>>>> counting you want.
>>>>>>>>>
>>>>>>>>> counter <- function(x, y, ignore.strand)
>>>>>>>>> countOverlaps(y, x, ignore.strand=ignore.strand)
>>>>>>>>>
>>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode=counter))$counts
>>>>>>>>> [,1]
>>>>>>>>> [1,] 1
>>>>>>>>> [2,] 1
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Valerie
>>>>>>>>>
>>>>>>>>> On 04/09/2013 09:37 AM, Thomas Girke wrote:
>>>>>>>>>> Hi Valerie,
>>>>>>>>>>
>>>>>>>>>> Perhaps let's call this more explicitly an
>> ignore_inter_feature_overlap option
>>>>>>>>>> that we often need for cases like this:
>>>>>>>>>>
>>>>>>>>>> Read1 ----
>>>>>>>>>> F1 ----------------
>>>>>>>>>> F2 -----------
>>>>>>>>>>
>>>>>>>>>> where we would like to have at least the option to assign Read1
>> to both F1 and F2:
>>>>>>>>>>
>>>>>>>>>> F1: 1
>>>>>>>>>> F2: 1
>>>>>>>>>>
>>>>>>>>>> However, summarizeOverlapse doesn't count Read1 at all in all of
>> its currently
>>>>>>>>>> available modes that I am aware of. This lack of an
>> ignore_inter_feature_overlap
>>>>>>>>>> option is frequently a problem when working with poorly
>> annotated genomes (high
>>>>>>>>>> uncertainty about hidden/incorrect feature overlaps) or various
>>>>>>>>>> RNA/ChIP-Seq _engineering_ projects where I rather take the risk
>> of ambiguous read
>>>>>>>>>> assignments than not counting at all as shown above.
>>>>>>>>>>
>>>>>>>>>> ## Here is a code example illustrating the same case:
>>>>>>>>>> library(GenomicRanges); library(Rsamtools)
>>>>>>>>>> rd <- GappedAlignments(letters[1], seqnames = Rle(rep("chr1",1)),
>>>>>>>>>> pos = as.integer(c(500)),
>>>>>>>>>> cigar = rep("100M", 1), strand = strand(rep("+", 1)))
>>>>>>>>>> gr1 <- GRanges("chr1", IRanges(start=100, width=1001),
>> strand="+", ID="feat1")
>>>>>>>>>> gr2 <- GRanges("chr1", IRanges(start=500, width=101),
>> strand="+", ID="feat2")
>>>>>>>>>> gr <- c(gr1, gr2)
>>>>>>>>>>
>>>>>>>>>> ## All of the three current modes in summarizeOverlaps return a
>> count of zero
>>>>>>>>>> ## because of its inter_feature_overlap awareness:
>>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="Union",
>> ignore.strand=TRUE))$counts
>>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionStrict",
>> ignore.strand=TRUE))$counts
>>>>>>>>>> assays(summarizeOverlaps(gr, rd, mode="IntersectionNotEmpty",
>> ignore.strand=TRUE))$counts
>>>>>>>>>> [,1]
>>>>>>>>>> [1,] 0
>>>>>>>>>> [2,] 0
>>>>>>>>>>
>>>>>>>>>> ## However, countOverlaps handles this correctly, but is not the
>> best choice when
>>>>>>>>>> ## counting multiple range features.
>>>>>>>>>> countOverlaps(gr, rd)
>>>>>>>>>> [1] 1 1
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thomas
>>>>>>>>>>
>>>>>>>>>>> sessionInfo()
>>>>>>>>>> R version 3.0.0 (2013-04-03)
>>>>>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>>>>>>>>>
>>>>>>>>>> locale:
>>>>>>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>>>>>>>>
>>>>>>>>>> attached base packages:
>>>>>>>>>> [1] parallel stats graphics grDevices utils datasets
>> methods
>>>>>>>>>> [8] base
>>>>>>>>>>
>>>>>>>>>> other attached packages:
>>>>>>>>>> [1] Rsamtools_1.12.0 Biostrings_2.28.0
>> GenomicRanges_1.12.1
>>>>>>>>>> [4] IRanges_1.18.0 BiocGenerics_0.6.0
>>>>>>>>>>
>>>>>>>>>> loaded via a namespace (and not attached):
>>>>>>>>>> [1] bitops_1.0-5 stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Apr 09, 2013 at 03:42:03PM +0000, Valerie Obenchain
>> wrote:
>>>>>>>>>>> Hi Thomas,
>>>>>>>>>>>
>>>>>>>>>>> On 04/08/2013 05:52 PM, Thomas Girke wrote:
>>>>>>>>>>>> Dear Valerie,
>>>>>>>>>>>>
>>>>>>>>>>>> Is there currently any way to run summarizeOverlaps in a
>> feature-overlap
>>>>>>>>>>>> unaware mode, e.g with an ignorefeatureOL=FALSE/TRUE setting?
>> Currently,
>>>>>>>>>>>> one can switch back to countOverlaps when feature overlap
>> unawareness is
>>>>>>>>>>>> the more appropriate counting mode for a biological question,
>> but then
>>>>>>>>>>>> double counting of reads mapping to multiple-range features is
>> not
>>>>>>>>>>>> accounted for. It would be really nice to have such a
>> feature-overlap
>>>>>>>>>>>> unaware option directly in summarizeOverlaps.
>>>>>>>>>>>
>>>>>>>>>>> No, we don't currently have an option to ignore
>> feature-overlap. It
>>>>>>>>>>> sounds like you want to count with countOverlaps() but still
>> want the
>>>>>>>>>>> double counting to be resolved. This is essentially what the
>> other modes
>>>>>>>>>>> are doing so I must be missing something.
>>>>>>>>>>>
>>>>>>>>>>> In this example 2 reads hit feature A, 1 read hits feature B.
>> With
>>>>>>>>>>> something like ignorefeature0L=TRUE, what results would you
>> expect to
>>>>>>>>>>> see? Maybe you have another, more descriptive example?
>>>>>>>>>>>
>>>>>>>>>>> reads <- GRanges("chr1", IRanges(c(1, 5, 20), width=3))
>>>>>>>>>>> features <- GRanges("chr1", IRanges(c(1, 20), width=10,
>>>>>>>>>>> names=c("A", "B")))
>>>>>>>>>>>
>>>>>>>>>>>> countOverlaps(features, reads)
>>>>>>>>>>> [1] 2 1
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Another question relates to the memory usage of
>> summarizeOverlaps. Has
>>>>>>>>>>>> this been optimized yet? On a typical bam file with ~50-100
>> million
>>>>>>>>>>>> reads the memory usage of summarizeOverlaps is often around
>> 10-20GB. To
>>>>>>>>>>>> use the function on a desktop computer or in large-scale
>> RNA-Seq
>>>>>>>>>>>> projects on a commodity compute cluster, it would be desirable
>> if every
>>>>>>>>>>>> counting instance would consume not more than 5GB of RAM.
>>>>>>>>>>>
>>>>>>>>>>> Have you tried the BamFileList-method? There is an example at
>> the bottom
>>>>>>>>>>> of the ?BamFileList man page using summarizeOverlaps(). As Ryan
>>>>>>>>>>> mentioned, the key is to set the 'yieldSize' parameter when
>> creating the
>>>>>>>>>>> BamFile. This method also makes use of mclapply().
>>>>>>>>>>>
>>>>>>>>>>> Valerie
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance for your help and suggestions,
>>>>>>>>>>>>
>>>>>>>>>>>> Thomas
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Bioconductor mailing list
>>>>>>>>>>>> Bioconductor at r-project.org
>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>>>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>>>>>>
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>>
>>> --
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}
More information about the Bioconductor
mailing list