[BioC] summarizeOverlaps question
James W. MacDonald
jmacdon at uw.edu
Thu Aug 30 17:20:55 CEST 2012
Hi Valerie,
On 8/29/2012 5:37 PM, Valerie Obenchain wrote:
> Hi Jim,
>
> I think what you're seeing is due to how the annotation is provided.
> In the first example there are many transcripts the reads can hit and
> in the second, there is only one. When a GRangesList is given as
> 'subject' each outer list element is considered the feature. So in the
> first example we have 55419 features,
>
> > library(TxDb.Mmusculus.UCSC.mm9.knownGene)
> > ex <- exonsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, use.names = TRUE)
> > class(ex)
> [1] "GRangesList"
> attr(,"package")
> [1] "GenomicRanges"
> > length(ex)
> [1] 55419
>
> In the second we have one,
> > ex2 <- ex["uc007amp.2"]
> > class(ex2)
> [1] "GRangesList"
> attr(,"package")
> [1] "GenomicRanges"
> > length(ex2)
> [1] 1
>
> summarizeOverlaps() will either discard reads that hit >1 feature
> ('Union') or it will make a decision about how to assign the read to a
> single feature if possible ('IntersectionStrict' or
> 'IntersectionNotEmpty'). I believe in the first case many of your
> reads hit > 1 transcript and cannot be resolved using
> 'IntersectionNotEmpty'. When a read can't be resolved it is dropped.
> In the second case no reads should be dropped due to multi-hits
> because there is only 1 possible feature to hit.
>
> Let me know if this does not clear things up and maybe I can take a
> closer look at your data.
Ah. An unexpected result. I didn't think this through as well as I
should have. The HTSeq diagram in the GenomicRanges package (as well as
the original at
http://www-huber.embl.de/users/anders/HTSeq/doc/count.html) imply
overlapping exons from different genes, so I didn't think about the
repercussions of overlapping exons from the *same* gene in different
transcripts.
Thanks,
Jim
>
> Valerie
>
>
> On 08/29/2012 09:39 AM, James W. MacDonald wrote:
>> I am getting results from summarizeOverlaps(<GrangesList>,
>> <BamFileList>) that I don't understand. Here is the way I normally
>> use the function:
>>
>> library(Rsamtools)
>> library(TxDb.Mmusculus.UCSC.mm9.knownGene)
>> ex <- exonsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, use.names = TRUE)
>> samps_casava <- BamFileList(<path to bamfiles>)
>> olaps <- summarizeOverlaps(ex, samps_casava, mode =
>> "IntersectionNotEmpty")
>>
>> and then as an example, check the counts for uc007amp.2
>>
>> > assays(olaps)$counts["uc007amp.2",, drop=F]
>> GG1_ATCACG_L002.uniques.sorted.bam
>> uc007amp.2 1
>> GG2_TTAGGC_L002.uniques.sorted.bam
>> uc007amp.2 1
>> GG3_ACTTGA_L002.uniques.sorted.bam
>> uc007amp.2 0
>> GG4_GATCAG_L002.uniques.sorted.bam
>> uc007amp.2 0
>> GG5_TAGCTT_L003.uniques.sorted.bam
>> uc007amp.2 0
>> GG6_GGCTAC_L003.uniques.sorted.bam
>> uc007amp.2 1
>> GG7_GTGGCC_L003.uniques.sorted.bam
>> uc007amp.2 0
>> GG8_GTTTCG_L003.uniques.sorted.bam
>> uc007amp.2 0
>> GG9_CGTACG_L004.uniques.sorted.bam
>> uc007amp.2 0
>> GG10_GAGTGG_L004.uniques.sorted.bam
>> uc007amp.2 0
>> GG11_ACTGAT_L004.uniques.sorted.bam
>> uc007amp.2 1
>> GG12_ATTCCT_L004.uniques.sorted.bam
>> uc007amp.2 3
>>
>> However, this doesn't make any sense to me, as there are actually
>> tons of reads that overlap this transcript (I know this from looking
>> at the transcript and these data using IGV). So let's just summarize
>> the overlaps for that transcript alone.
>>
>> > ex2 <- ex["uc007amp.2"]
>> > olaps2 <- summarizeOverlaps(ex2, samps_casava, mode =
>> "IntersectionNotEmpty")
>> > assays(olaps2)$counts
>> GG1_ATCACG_L002.uniques.sorted.bam
>> uc007amp.2 45
>> GG2_TTAGGC_L002.uniques.sorted.bam
>> uc007amp.2 90
>> GG3_ACTTGA_L002.uniques.sorted.bam
>> uc007amp.2 70
>> GG4_GATCAG_L002.uniques.sorted.bam
>> uc007amp.2 64
>> GG5_TAGCTT_L003.uniques.sorted.bam
>> uc007amp.2 111
>> GG6_GGCTAC_L003.uniques.sorted.bam
>> uc007amp.2 26
>> GG7_GTGGCC_L003.uniques.sorted.bam
>> uc007amp.2 34
>> GG8_GTTTCG_L003.uniques.sorted.bam
>> uc007amp.2 36
>> GG9_CGTACG_L004.uniques.sorted.bam
>> uc007amp.2 98
>> GG10_GAGTGG_L004.uniques.sorted.bam
>> uc007amp.2 153
>> GG11_ACTGAT_L004.uniques.sorted.bam
>> uc007amp.2 164
>> GG12_ATTCCT_L004.uniques.sorted.bam
>> uc007amp.2 174
>>
>>
>> Any ideas where I am going wrong?
>>
>> Best,
>>
>> Jim
>>
>>
>
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list