[BioC] summarizeOverlaps question

Thu Aug 30 23:09:17 CEST 2012

As another option, summarizeOverlaps() can take any counting function as 
the 'mode' and put the results in a SummarizedExperiment.  If the 
built-in counting modes are too conservative or if you have your own 
method of counting you can create your own function and pass it to 
'mode'. The restrictions are

(1)  your function must have the same signature as existing modes which is

function(reads, features, ignore.strand = FALSE, ...)

and

(2) it should return a vector of counts the same length as 'features'.

Here is an example using countOverlaps(),

countOverlapsSO <-  function(reads, features, ignore.strand = FALSE, ...)
{
     countOverlaps(features, reads, ignore.strand=ignore.strand)
}

library("TxDb.Dmelanogaster.UCSC.dm3.ensGene")
exbygene <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, "gene")
fl <- system.file("extdata", "untreated1_chr4.bam", 
package="pasillaBamSubset")
bfl <- BamFileList(fl)

 > summarizeOverlaps(exbygene, bfl, mode=countOverlapsSO)
class: SummarizedExperiment
dim: 14869 1
exptData(0):
assays(1): counts
rownames(14869): FBgn0000003 FBgn0000008 ... FBgn0261574 FBgn0261575
rowData metadata column names(0):
colnames(1):

/home/vobencha/R/R-dev/R-2-15-branch/library/pasillaBamSubset/extdata/untreated1_chr4.bam
colData names(1): fileName

Valerie

 > sessionInfo()
R version 2.15.1 Patched (2012-07-19 r59904)
Platform: x86_64-unknown-linux-gnu (64-bit)
...

other attached packages:
[1] TxDb.Dmelanogaster.UCSC.dm3.ensGene_2.7.1
[2] GenomicFeatures_1.9.36
[3] AnnotationDbi_1.19.30
[4] Biobase_2.17.6
[5] Rsamtools_1.9.28
[6] Biostrings_2.25.12
[7] GenomicRanges_1.9.57
[8] IRanges_1.15.42
[9] BiocGenerics_0.3.1

loaded via a namespace (and not attached):
  [1] biomaRt_2.13.2      bitops_1.0-4.1      BSgenome_1.25.6
  [4] DBI_0.2-5           RCurl_1.91-1        RSQLite_0.11.1
  [7] rtracklayer_1.17.19 stats4_2.15.1       tools_2.15.1
[10] XML_3.9-4           zlibbioc_1.3.0
On 08/30/2012 08:20 AM, James W. MacDonald wrote:
> Hi Valerie,
>
> On 8/29/2012 5:37 PM, Valerie Obenchain wrote:
>> Hi Jim,
>>
>> I think what you're seeing is due to how the annotation is provided. 
>> In the first example there are many transcripts the reads can hit and 
>> in the second, there is only one. When a GRangesList is given as 
>> 'subject' each outer list element is considered the feature. So in 
>> the first example we have 55419 features,
>>
>> > library(TxDb.Mmusculus.UCSC.mm9.knownGene)
>> > ex <- exonsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, use.names = TRUE)
>> > class(ex)
>> [1] "GRangesList"
>> attr(,"package")
>> [1] "GenomicRanges"
>> > length(ex)
>> [1] 55419
>>
>> In the second we have one,
>> > ex2 <- ex["uc007amp.2"]
>> > class(ex2)
>> [1] "GRangesList"
>> attr(,"package")
>> [1] "GenomicRanges"
>> > length(ex2)
>> [1] 1
>>
>> summarizeOverlaps() will either discard reads that hit >1 feature 
>> ('Union') or it will make a decision about how to assign the read to 
>> a single feature if possible ('IntersectionStrict' or 
>> 'IntersectionNotEmpty'). I believe in the first case many of your 
>> reads hit > 1 transcript and cannot be resolved using 
>> 'IntersectionNotEmpty'. When a read can't be resolved it is dropped. 
>> In the second case no reads should be dropped due to multi-hits 
>> because there is only 1 possible feature to hit.
>>
>> Let me know if this does not clear things up and maybe I can take a 
>> closer look at your data.
>
> Ah. An unexpected result. I didn't think this through as well as I 
> should have. The HTSeq diagram in the GenomicRanges package (as well 
> as the original at 
> http://www-huber.embl.de/users/anders/HTSeq/doc/count.html) imply 
> overlapping exons from different genes, so I didn't think about the 
> repercussions of overlapping exons from the *same* gene in different 
> transcripts.
>
> Thanks,
>
> Jim
>
>
>
>>
>> Valerie
>>
>>
>> On 08/29/2012 09:39 AM, James W. MacDonald wrote:
>>> I am getting results from summarizeOverlaps(<GrangesList>, 
>>> <BamFileList>) that I don't understand. Here is the way I normally 
>>> use the function:
>>>
>>> library(Rsamtools)
>>> library(TxDb.Mmusculus.UCSC.mm9.knownGene)
>>> ex <- exonsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, use.names = TRUE)
>>> samps_casava <- BamFileList(<path to bamfiles>)
>>> olaps <- summarizeOverlaps(ex, samps_casava, mode = 
>>> "IntersectionNotEmpty")
>>>
>>> and then as an example, check the counts for uc007amp.2
>>>
>>> > assays(olaps)$counts["uc007amp.2",, drop=F]
>>>            GG1_ATCACG_L002.uniques.sorted.bam
>>> uc007amp.2                                  1
>>>            GG2_TTAGGC_L002.uniques.sorted.bam
>>> uc007amp.2                                  1
>>>            GG3_ACTTGA_L002.uniques.sorted.bam
>>> uc007amp.2                                  0
>>>            GG4_GATCAG_L002.uniques.sorted.bam
>>> uc007amp.2                                  0
>>>            GG5_TAGCTT_L003.uniques.sorted.bam
>>> uc007amp.2                                  0
>>>            GG6_GGCTAC_L003.uniques.sorted.bam
>>> uc007amp.2                                  1
>>>            GG7_GTGGCC_L003.uniques.sorted.bam
>>> uc007amp.2                                  0
>>>            GG8_GTTTCG_L003.uniques.sorted.bam
>>> uc007amp.2                                  0
>>>            GG9_CGTACG_L004.uniques.sorted.bam
>>> uc007amp.2                                  0
>>>            GG10_GAGTGG_L004.uniques.sorted.bam
>>> uc007amp.2                                   0
>>>            GG11_ACTGAT_L004.uniques.sorted.bam
>>> uc007amp.2                                   1
>>>            GG12_ATTCCT_L004.uniques.sorted.bam
>>> uc007amp.2                                   3
>>>
>>> However, this doesn't make any sense to me, as there are actually 
>>> tons of reads that overlap this transcript (I know this from looking 
>>> at the transcript and these data using IGV). So let's just summarize 
>>> the overlaps for that transcript alone.
>>>
>>> > ex2 <- ex["uc007amp.2"]
>>> > olaps2 <- summarizeOverlaps(ex2, samps_casava, mode = 
>>> "IntersectionNotEmpty")
>>> > assays(olaps2)$counts
>>>            GG1_ATCACG_L002.uniques.sorted.bam
>>> uc007amp.2                                 45
>>>            GG2_TTAGGC_L002.uniques.sorted.bam
>>> uc007amp.2                                 90
>>>            GG3_ACTTGA_L002.uniques.sorted.bam
>>> uc007amp.2                                 70
>>>            GG4_GATCAG_L002.uniques.sorted.bam
>>> uc007amp.2                                 64
>>>            GG5_TAGCTT_L003.uniques.sorted.bam
>>> uc007amp.2                                111
>>>            GG6_GGCTAC_L003.uniques.sorted.bam
>>> uc007amp.2                                 26
>>>            GG7_GTGGCC_L003.uniques.sorted.bam
>>> uc007amp.2                                 34
>>>            GG8_GTTTCG_L003.uniques.sorted.bam
>>> uc007amp.2                                 36
>>>            GG9_CGTACG_L004.uniques.sorted.bam
>>> uc007amp.2                                 98
>>>            GG10_GAGTGG_L004.uniques.sorted.bam
>>> uc007amp.2                                 153
>>>            GG11_ACTGAT_L004.uniques.sorted.bam
>>> uc007amp.2                                 164
>>>            GG12_ATTCCT_L004.uniques.sorted.bam
>>> uc007amp.2                                 174
>>>
>>>
>>> Any ideas where I am going wrong?
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>
>