[BioC] reducing hits from countGenomicOverlaps()

Thu Oct 27 21:44:45 CEST 2011

Hi Valerie,

On 10/27/11 7:53 PM, Valerie Obenchain wrote:
> Hi Robert,
>
>
> On 10/26/2011 03:34 PM, Robert Castelo wrote:
>> hi,
>>
>> On 10/27/11 12:07 AM, Martin Morgan wrote:
>>> On 10/25/2011 05:35 PM, Robert Castelo wrote:
>>>> dear list,
>>>>
>>>> the following three lines allow one to count overlaps of aligned
>>>> short-reads with annotations:
>>>>
>>>> aln <- readGappedAlignments("somebamfile.bam")
>>>> txdb <- makeTranscriptFromUCSC(genome="hg19", tablename="ensGene")
>>>> ensGenes <- exonsBy(txdb, by="gene")
>>>> ov <- countGenomicOverlaps(aln, ensGenes)
>>>>
>>>> then i want to get read-counts per gene and the first thing that comes
>>>> to my head is doing:
>>>>
>>>> counts <- sapply(ov, function(x) sum(values(x)[["hits"]]))
>>>>
>>>> which goes through every gene and adds up the "hits" of its exons.
>>>> however, this latter step of "just adding" takes longer than the actual
>>>> calculation of the hits with countGenomicOverlaps() and i guess that
>>>> there are more efficient ways to approach this, probably something
>>>> around "reducing the hits value column". i've been looking at rdapply()
>>>> and reduce() and googled too, but couldn't find anything, so i look
>>>> forward to your suggestions.
>>>
>>> Hi Robert -- A strategy here is along the lines (untested!) of
>>>
>>> hits <- values(unlist(ov)))[["hits"]]
>>> genes <- rep(names(ov), elementLengths(ov))
>>> counts <- sapply(split(hits, genes), sum)
>>
>> beautiful and fast, i've just tested it with the ~ 50,000 Ensembl
>> genes and the execution time is nearly negligible, less than half a
>> second in my laptop.
>>
>>> but you'll want to make sure that this is a conceptually sensible way of
>>> counting hits per gene.
>>
>> i'm aiming at counting exonic-reads and adding them up at gene level.
>> the function countGenomicOverlaps() allows me to tune the part of the
>> logic related to counting exonic reads, but is there any function that
>> would allow me to tune the summing of exonic reads?
>
> How do you want to tune the summing? If you want to sum all exons by
> gene the piece of code Martin provided should do it. Was there another
> goal you had in mind?

well, Martin was warning me about whether just adding up exon counts per 
gene makes complete sense and i guess the problem might arise with 
overlapping exons of different transcripts since then by just summing 
i'd be adding some counts twice. so i was wondering whether something 
existed to address this summing along the lines of what is available 
about the decision logic options for counting reads overlapping 
annotations with countGenomicOverlaps().

cheers,
robert.