[BioC] summarizeOverlaps vs countOverlaps

Wed Feb 29 21:07:35 CET 2012

Hello,

You should expect different results from countOverlaps and 
summarizeOverlaps because they count in different ways.

The different modes of summarizeOverlaps counting are explained in the 
vignette, "Overview of summarizeOverlaps", found by

browseVignettes("GenomicRanges") or at

http://bioconductor.org/packages/2.10/bioc/html/GenomicRanges.html

summarizeOverlaps was developed to provide count methods that do not 
"double count" a read that hit more than one feature (where a feature is 
an exon, transcript etc.). summarizeOverlaps also treats a GRanges and 
GRangesList differently. This is explained in the man page and in the 
vignette. Once you've had a chance to read the documentation let me know 
if this is not clear and I'd be happy to explain with examples.

Note that the GappedAlignments container is gap-aware. When a 
GappedAlignment is coerced to a GRangesList the portions of the gapped 
read become distinct rows in the new GRanges obejct. When a 
GappedAlignment is coerced to a GRanges the gaps are collapsed and the 
range is continuous. In other words we are no longer aware of our gaps.

## Here we create the three objects with the same reads,

galn <- GappedAlignments(
            names = c("a","b","c"),
            rname =  Rle(rep("chr1", 3)),
            pos = as.integer(c(1400, 3100, 5200)),
            cigar = c("500M",  "50M200N50M", "50M150N50M"),
            strand = strand(rep("+", 3)))
gr <- as(galn, "GRanges")
grl <- as(galn, "GRangesList")

## Create a subject for overlap testing that falls in the gap of the 
second range

 > subj <- GRanges("chr1", IRanges(3200, 3300))
 > subj
GRanges with 1 range and 0 elementMetadata cols:
       seqnames       ranges strand
<Rle> <IRanges> <Rle>
   [1]     chr1 [3200, 3300]      *
   ---
   seqlengths:
    chr1
      NA

## Investigate how this range is counted

 > countOverlaps(galn, subj)
a b c
0 0 0
 > countOverlaps(gr, subj)
a b c
0 1 0
 > countOverlaps(grl, subj)
a b c
0 0 0

Valerie

On 02/29/12 10:24, swaraj basu wrote:
> Dear All,
>
> I am getting completely different results for the count of mapped reads on
> transcript features using the countOverlaps and summarizeOverlaps.
>
> CountOverlaps on using a gappedAlignment and Granges object gives slightly
> different results. SummarizeOverlaps gives totally different results for
> the modes "Union" and "intersectionNotEmpty".
>
> Can someone please let me know where I am making a mistake. I have attached
> a screenshot of the transcripts. All the transcripts belong to the same
> gene.
>
> Following steps were performed by me on a small subset of my data
>
> 1. I built a gapped alignment object of mapped reads using the command
> reads=readBamGappedAlignments("path / to / bam/sorted.bam")
>
> 2. From the gapped alignment I built a genomic ranges object
> ranges=GRanges(seqnames=rname(reads),
> ranges=IRanges(start=start(reads),
> end=end(reads)),strand=rep("*",length(reads)))
>
> 3. I built a genomic ranges list from a custom genedb object (4
> transcripts).
> geneE=exonsBy(genedb,'gene')
>
> 4. Used the count and summarize overlaps functions
> a). count1=countOverlaps(geneE, reads)
> b). count2=countOverlaps(geneE, ranges)
> c). count3=assays(summarizeOverlaps(geneE, reads,
> mode="UNION"))$counts[,1]
> c). count4=assays(summarizeOverlaps(geneE, reads,
> mode="intersectionNotEmpty"))$counts[,1]
>
> 5. I get four different count results
> "T1, T2, T3, T4"
> a). count1 "55972, 41029, 18270, 18734"
> b). count2 "55989, 41046, 18287, 18751"
> c). count3 "0, 2, 0, 1092"
> d). count4 "0, 3712, 0, 2550"
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor