[BioC] summarizeOverlaps, inter.feature=FALSE and mode="IntersectionNotEmpty"
Valerie Obenchain
vobencha at fhcrc.org
Wed May 15 22:13:14 CEST 2013
One example where 'IntersectionNotEmpty' results in different counts is
when a read spans both features even after the shared regions have been
removed.
rd <- GAlignments("chr2", 7000L, "750M", strand("+"))
ft <- GRanges("chr2", IRanges(c(7000, 7500), width=c(600, 300)), "+")
When TRUE, the read can't be resolved.
se1 <- summarizeOverlaps(ft, rd, mode, inter.feature=TRUE)
> assays(se1)$counts
[,1]
[1,] 0
[2,] 0
When FALSE, each feature gets a hit.
se2 <- summarizeOverlaps(ft, rd, mode, inter.feature=FALSE)
> assays(se2)$counts
[,1]
[1,] 1
[2,] 1
Updates are implemented in GenomicRanges 1.13.12.
Thanks.
Valerie
On 05/15/2013 10:04 AM, Thomas Girke wrote:
> I agree, given the definition of "IntersectionNotEmpty", the count for
> that read in row 6 should be 1 for Feature1 and 0 for Feature2, whereas
> in row 7 it should be 0 for both. This also means, in this specific case
> there will be no difference in the read assignment (count results) when
> inter.feature is set to FALSE or TRUE. I guess, maining consistent
> feature defintions in the different counting modes will be certainly
> important. Thanks Michael for pointing this out.
>
> Thomas
>
> On Wed, May 15, 2013 at 03:47:46PM +0000, Valerie Obenchain wrote:
>> I agree Mike. Following this same logic, row 7 for IntersectionNotEmpty
>> should be 0,0 instead of 1,1.
>>
>> Thomas, Martin, would you agree?
>>
>> Valerie
>>
>>
>> On 05/15/2013 04:47 AM, Michael Love wrote:
>>> hi,
>>>
>>> The new arguments 'inter.feature' and 'fragments' of summarizeOverlaps look
>>> very useful.
>>>
>>> I'm wondering about the behavior for inter.feature=FALSE and
>>> mode="IntersectionNotEmpty".
>>>
>>> Earlier on the list, from Martin Morgan, there was the proposed behavior:
>>>
>>> You'd like to add 3 more modes, with inter_feature=FALSE. Reads are sometimes
>>> 'counted twice'
>>>
>>> |*----------------------+-----+-----------+-----------+---------------|*|*
>>> Mode | Row | Feature 1 | Feature 2 | Hits per read
>>> |*|*----------------------+-----+-----------+-----------+---------------|*|*
>>> Union | 5 | 1 | 0 | 1
>>> |*|* | 6 | 1 | 1 |
>>> 2 |*|* | 7 | 1 | 1 |
>>> 2 |*|* IntersectionStrict | 5 | 1 | 0 |
>>> 1 |*|* | 6 | 1 | 0 |
>>> 1 |*|* | 7 | 1 | 1 |
>>> 2 |*|* IntersectionNotEmpty | 5 | 1 | 0 |
>>> 1 |*|* | 6 | 1 | 1 |
>>> 2 |*|* | 7 | 1 | 1 |
>>> 2 |*|*----------------------+-----+-----------+-----------+---------------|*
>>>
>>> ( https://stat.ethz.ch/pipermail/bioconductor/2013-April/052064.html )
>>>
>>> For row 6 of this diagram (
>>> http://bioconductor.org/packages/2.13/bioc/vignettes/GenomicRanges/inst/doc/summarizeOverlaps-modes.pdf)
>>> and 'IntersectionNotEmpty', I don't see why Feature 2 gets a hit.
>>>
>>> The man page has:
>>>
>>> IntersectionNotEmpty : For this counting mode, the features are partitioned
>>> into unique disjoint regions. This is accomplished by disjoining the
>>> feature ranges then removing ranges shared by more than one feature. The
>>> result is a group of non-overlapping regions each of which belong to a
>>> single feature.
>>> Simple and gapped reads are counted if,
>>> 1. the read or exactly 1 of the read fragments overlaps a unique disjoint
>>> region
>>> 2. the read or >1 read fragments overlap >1 unique disjoint region from
>>> the same feature
>>>
>>> An example, where the read falls completely within Feature 1 but partially
>>> in Feature 2, the behavior is consistent with the table above:
>>>
>>>> reads <-
>>> GAlignments(seqnames="chr1",pos=1400L,cigar="500M",strand=strand("+"))
>>>> features <- GRanges("chr1",IRanges(c(1300L,1700L),c(2000L,2400L)))
>>>> assay(summarizeOverlaps(features,reads,mode="IntersectionNotEmpty",
>>> inter.feature=FALSE))
>>> [,1]
>>> [1,] 1
>>> [2,] 1
>>>
>>>> disjoin(features)
>>> GRanges with 3 ranges and 0 metadata columns:
>>> seqnames ranges strand
>>> <Rle> <IRanges> <Rle>
>>> [1] chr1 [1300, 1699] *
>>> [2] chr1 [1700, 2000] *
>>> [3] chr1 [2001, 2400] *
>>> ---
>>> seqlengths:
>>> chr1
>>> NA
>>>
>>>> findOverlaps(reads, disjoin(features))
>>> Hits of length 2
>>> queryLength: 1
>>> subjectLength: 3
>>> queryHits subjectHits
>>> <integer> <integer>
>>> 1 1 1
>>> 2 1 2
>>>
>>> So the example read overlaps the unique region of Feature 1 and the shared
>>> region, which supposedly has been removed. So I would expect it to only
>>> count to Feature 1.
>>>
>>>> sessionInfo()
>>> R Under development (unstable) (2013-05-14 r62742)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] C
>>>
>>> attached base packages:
>>> [1] parallel stats graphics grDevices utils datasets methods
>>> [8] base
>>>
>>> other attached packages:
>>> [1] GenomicRanges_1.13.11 XVector_0.0.0 IRanges_1.19.7
>>> [4] BiocGenerics_0.7.2 BiocInstaller_1.11.1 Defaults_1.1-1
>>>
>>> loaded via a namespace (and not attached):
>>> [1] stats4_3.1.0
>>>
>>>
>>> thanks,
>>>
>>> Mike
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list