[BioC] summarizeOverlaps vs countOverlaps
Valerie Obenchain
vobencha at fhcrc.org
Wed Feb 29 21:07:35 CET 2012
Hello,
You should expect different results from countOverlaps and
summarizeOverlaps because they count in different ways.
The different modes of summarizeOverlaps counting are explained in the
vignette, "Overview of summarizeOverlaps", found by
browseVignettes("GenomicRanges") or at
http://bioconductor.org/packages/2.10/bioc/html/GenomicRanges.html
summarizeOverlaps was developed to provide count methods that do not
"double count" a read that hit more than one feature (where a feature is
an exon, transcript etc.). summarizeOverlaps also treats a GRanges and
GRangesList differently. This is explained in the man page and in the
vignette. Once you've had a chance to read the documentation let me know
if this is not clear and I'd be happy to explain with examples.
Note that the GappedAlignments container is gap-aware. When a
GappedAlignment is coerced to a GRangesList the portions of the gapped
read become distinct rows in the new GRanges obejct. When a
GappedAlignment is coerced to a GRanges the gaps are collapsed and the
range is continuous. In other words we are no longer aware of our gaps.
## Here we create the three objects with the same reads,
galn <- GappedAlignments(
names = c("a","b","c"),
rname = Rle(rep("chr1", 3)),
pos = as.integer(c(1400, 3100, 5200)),
cigar = c("500M", "50M200N50M", "50M150N50M"),
strand = strand(rep("+", 3)))
gr <- as(galn, "GRanges")
grl <- as(galn, "GRangesList")
## Create a subject for overlap testing that falls in the gap of the
second range
> subj <- GRanges("chr1", IRanges(3200, 3300))
> subj
GRanges with 1 range and 0 elementMetadata cols:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [3200, 3300] *
---
seqlengths:
chr1
NA
## Investigate how this range is counted
> countOverlaps(galn, subj)
a b c
0 0 0
> countOverlaps(gr, subj)
a b c
0 1 0
> countOverlaps(grl, subj)
a b c
0 0 0
Valerie
On 02/29/12 10:24, swaraj basu wrote:
> Dear All,
>
> I am getting completely different results for the count of mapped reads on
> transcript features using the countOverlaps and summarizeOverlaps.
>
> CountOverlaps on using a gappedAlignment and Granges object gives slightly
> different results. SummarizeOverlaps gives totally different results for
> the modes "Union" and "intersectionNotEmpty".
>
> Can someone please let me know where I am making a mistake. I have attached
> a screenshot of the transcripts. All the transcripts belong to the same
> gene.
>
> Following steps were performed by me on a small subset of my data
>
> 1. I built a gapped alignment object of mapped reads using the command
> reads=readBamGappedAlignments("path / to / bam/sorted.bam")
>
> 2. From the gapped alignment I built a genomic ranges object
> ranges=GRanges(seqnames=rname(reads),
> ranges=IRanges(start=start(reads),
> end=end(reads)),strand=rep("*",length(reads)))
>
> 3. I built a genomic ranges list from a custom genedb object (4
> transcripts).
> geneE=exonsBy(genedb,'gene')
>
> 4. Used the count and summarize overlaps functions
> a). count1=countOverlaps(geneE, reads)
> b). count2=countOverlaps(geneE, ranges)
> c). count3=assays(summarizeOverlaps(geneE, reads,
> mode="UNION"))$counts[,1]
> c). count4=assays(summarizeOverlaps(geneE, reads,
> mode="intersectionNotEmpty"))$counts[,1]
>
> 5. I get four different count results
> "T1, T2, T3, T4"
> a). count1 "55972, 41029, 18270, 18734"
> b). count2 "55989, 41046, 18287, 18751"
> c). count3 "0, 2, 0, 1092"
> d). count4 "0, 3712, 0, 2550"
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list