[BioC] summarizeOverlaps, inter.feature=FALSE and mode="IntersectionNotEmpty"

Wed May 15 22:13:14 CEST 2013

One example where 'IntersectionNotEmpty' results in different counts is 
when a read spans both features even after the shared regions have been 
removed.

rd <- GAlignments("chr2", 7000L, "750M", strand("+"))
ft <- GRanges("chr2", IRanges(c(7000, 7500), width=c(600, 300)), "+")

When TRUE, the read can't be resolved.
se1 <- summarizeOverlaps(ft, rd, mode, inter.feature=TRUE)
 > assays(se1)$counts
      [,1]
[1,]    0
[2,]    0

When FALSE, each feature gets a hit.
se2 <- summarizeOverlaps(ft, rd, mode, inter.feature=FALSE)
 > assays(se2)$counts
      [,1]
[1,]    1
[2,]    1

Updates are implemented in GenomicRanges 1.13.12.

Thanks.
Valerie

On 05/15/2013 10:04 AM, Thomas Girke wrote:
> I agree, given the definition of "IntersectionNotEmpty", the count for
> that read in row 6 should be 1 for Feature1 and 0 for Feature2, whereas
> in row 7 it should be 0 for both. This also means, in this specific case
> there will be no difference in the read assignment (count results) when
> inter.feature is set to FALSE or TRUE. I guess, maining consistent
> feature defintions in the different counting modes will be certainly
> important. Thanks Michael for pointing this out.
>
> Thomas
>
> On Wed, May 15, 2013 at 03:47:46PM +0000, Valerie Obenchain wrote:
>> I agree Mike. Following this same logic, row 7 for IntersectionNotEmpty
>> should be 0,0 instead of 1,1.
>>
>> Thomas, Martin, would you agree?
>>
>> Valerie
>>
>>
>> On 05/15/2013 04:47 AM, Michael Love wrote:
>>> hi,
>>>
>>> The new arguments 'inter.feature' and 'fragments' of summarizeOverlaps look
>>> very useful.
>>>
>>> I'm wondering about the behavior for inter.feature=FALSE and
>>> mode="IntersectionNotEmpty".
>>>
>>> Earlier on the list, from Martin Morgan, there was the proposed behavior:
>>>
>>> You'd like to add 3 more modes, with inter_feature=FALSE. Reads are sometimes
>>> 'counted twice'
>>>
>>> |*----------------------+-----+-----------+-----------+---------------|*|*
>>> Mode                 | Row | Feature 1 | Feature 2 | Hits per read
>>> |*|*----------------------+-----+-----------+-----------+---------------|*|*
>>> Union                |   5 |         1 |         0 |             1
>>> |*|*                      |   6 |         1 |         1 |
>>> 2 |*|*                      |   7 |         1 |         1 |
>>>    2 |*|* IntersectionStrict   |   5 |         1 |         0 |
>>>     1 |*|*                      |   6 |         1 |         0 |
>>>      1 |*|*                      |   7 |         1 |         1 |
>>>       2 |*|* IntersectionNotEmpty |   5 |         1 |         0 |
>>>        1 |*|*                      |   6 |         1 |         1 |
>>>         2 |*|*                      |   7 |         1 |         1 |
>>>          2 |*|*----------------------+-----+-----------+-----------+---------------|*
>>>
>>> ( https://stat.ethz.ch/pipermail/bioconductor/2013-April/052064.html )
>>>
>>> For row 6 of this diagram (
>>> http://bioconductor.org/packages/2.13/bioc/vignettes/GenomicRanges/inst/doc/summarizeOverlaps-modes.pdf)
>>> and 'IntersectionNotEmpty', I don't see why Feature 2 gets a hit.
>>>
>>> The man page has:
>>>
>>> IntersectionNotEmpty : For this counting mode, the features are partitioned
>>> into unique disjoint regions. This is accomplished by disjoining the
>>> feature ranges then removing ranges shared by more than one feature. The
>>> result is a group of non-overlapping regions each of which belong to a
>>> single feature.
>>> Simple and gapped reads are counted if,
>>> 1. the read or exactly 1 of the read fragments overlaps a unique disjoint
>>> region
>>> 2. the read or >1 read fragments overlap >1 unique disjoint region from
>>> the same feature
>>>
>>> An example, where the read falls completely within Feature 1 but partially
>>> in Feature 2, the behavior is consistent with the table above:
>>>
>>>> reads <-
>>> GAlignments(seqnames="chr1",pos=1400L,cigar="500M",strand=strand("+"))
>>>> features <- GRanges("chr1",IRanges(c(1300L,1700L),c(2000L,2400L)))
>>>> assay(summarizeOverlaps(features,reads,mode="IntersectionNotEmpty",
>>> inter.feature=FALSE))
>>>        [,1]
>>> [1,]    1
>>> [2,]    1
>>>
>>>> disjoin(features)
>>> GRanges with 3 ranges and 0 metadata columns:
>>>         seqnames       ranges strand
>>>            <Rle>    <IRanges>  <Rle>
>>>     [1]     chr1 [1300, 1699]      *
>>>     [2]     chr1 [1700, 2000]      *
>>>     [3]     chr1 [2001, 2400]      *
>>>     ---
>>>     seqlengths:
>>>      chr1
>>>        NA
>>>
>>>> findOverlaps(reads, disjoin(features))
>>> Hits of length 2
>>> queryLength: 1
>>> subjectLength: 3
>>>     queryHits subjectHits
>>>      <integer>   <integer>
>>>    1         1           1
>>>    2         1           2
>>>
>>> So the example read overlaps the unique region of Feature 1 and the shared
>>> region, which supposedly has been removed. So I would expect it to only
>>> count to Feature 1.
>>>
>>>> sessionInfo()
>>> R Under development (unstable) (2013-05-14 r62742)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] C
>>>
>>> attached base packages:
>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>> [8] base
>>>
>>> other attached packages:
>>> [1] GenomicRanges_1.13.11 XVector_0.0.0         IRanges_1.19.7
>>> [4] BiocGenerics_0.7.2    BiocInstaller_1.11.1  Defaults_1.1-1
>>>
>>> loaded via a namespace (and not attached):
>>> [1] stats4_3.1.0
>>>
>>>
>>> thanks,
>>>
>>> Mike
>>>
>>> 	[[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor