[BioC] multiple feature/mode counting with summarizeOverlaps

Thu May 23 23:06:10 CEST 2013

Hi Thomas,

On 05/20/2013 10:12 PM, Thomas Girke wrote:
> I apologize up front to Valarie for posting additional questions related
> to her excellent read counter. This time my questions are related to the
> economics of accessing big files as few times as possible to improve
> workflow performance in large RNA-Seq projects and to subsequently allow
> deleting all input bam files since BIG DATA is killing us :).
>
> (1) What is currently the intended approach to obtain counts for
> multiple feature types in a single pass-through of one or many bam
> files?  For instance, often we want to count the reads overlapping with
> many different features types (not just one), such as exonBygenes,
> cdsBygenes, UTRs, introns and intergenics. Computing all those feature
> counts sequentially in a loop is easy, but can be time consuming and
> heavy on disk I/O (the most expensive resource in NGS analysis) and
> internal networks when running many counting processes in parallel. As a
> partial solution to this, I often organize all feature types in one big
> GRanges/GRangesList container, perform the counting and then split the
> resulting count tables accordingly. However, is this really the best way
> of doing this?  A major limitation of this approach is that it does not
> support feature overlap aware counting modes.

I would not recommend combining different annotations into a single
GRanges/GRangesList. To summarizeOverlaps(), each row of a GRanges or 
each list element of a GRangesList represents a feature. If a read hits 
more than one feature we have a double hit situation that must be 
resolved. If you're interested in counting both exonsByGene and 
transcriptsByGene you probably want these to be independent counts. By 
combining them into one object they can be confounding and create 
multiple hits situations where there really aren't any.

Toy example, but let's say 'gr1' comes from exonsBy() and 'gr2' from 
3UTRsByTranscript().

gr1 <- GRanges(c("chr1", "chr1"), IRanges(c(1, 10), width=10))
gr2 <- GRanges("chr1", IRanges(25, width=10))
grl <- GRangesList(gr1, gr2)
reads <- GAlignments("chr1", pos=18L, cigar="10M", strand=strand("*"))

The read hits gr1 and gr2. If counted separately,

 > assays(summarizeOverlaps(grl[1], reads, Union))$counts
      reads
[1,]     1
 > assays(summarizeOverlaps(grl[2], reads, Union))$counts
      reads
[1,]     1

Now count as GRangesList,

 > assays(summarizeOverlaps(grl, reads, Union))$counts
      reads
[1,]     0
[2,]     0

Using 'inter.feature=FALSE' will regain the hits with modes Union and 
IntersectionStrict but not necessarily IntersectionNotEmpty because 
shared regions of the annotation are removed before counting.

 > assays(summarizeOverlaps(grl, reads, Union, inter.feature=FALSE))$counts
      reads
[1,]     1
[2,]     1

>
> (2) Similarly, what if I want to generate counts for multiple counting
> modes in a single call to summarizeOverlaps, such as different overlap
> modes, or sense and antisense counts which we often want to report when
> working with strand specific RNA-Seq data? Are there plans to support
> this? BTW: currently, I generate antisense read counts by inversing the
> strand information of the features which works fine just not for both
> ways in one step.

We don't have a built-in method to count Bams over multiple annotations 
and multiple count modes. There are several approaches you could take. 
After some discussion this was one we came up with.

     ## Count a single file.
     count1bam <-
         function(file, features, mode, ...,
                  yieldSize=1000000L, reader=readGAlignments,
                  param=ScanBamParam())
     {
         ## initialize
         bf <- open(BamFile(file, yieldSize=yieldSize))
         counts <- lapply(sapply(features, length), integer)

         ## iterate
         while(length(reads <- reader(bf, param=param)))
             for (i in seq_along(features))
                 counts[[i]] <-
                     counts[[i]] + mode(features[[i]], reads, ...)
         close(bf)
         counts
     }

     ## Count over files in parallel.
     countbams <-
         function(features, files, ...)
     {
         counts0 <- mclapply(files, count1bam, features=features, ...)
         counts <- lapply(names(features), function(i, counts) {
             sapply(counts0, "[[", i)
         }, counts0)
     }

     ## In practice.
     library(parallel); options(mc.cores=detectCores())
     library(GenomicRanges)
     library(Rsamtools)
     features <-
         list(coarse=GRanges("seq1", successiveIRanges(rep(100, 15))),
              fine=GRanges("seq2", successiveIRanges(rep(20, 45))))
     fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
     files <- list(fl, fl)  ## usually different files

     ## As 'mode' you can directly call Union, IntersectionStrict,
     ## and IntersectionNotEmpty.
     result <- countbams(features, files, mode=Union,
                         ignore.strand=TRUE, inter.feature=FALSE)

     ## 'reader' specifies how the data are read in. For paired-end
     ## use readGAlignmentsList (qname sorted or readGAlignmentPairs.
     result <- countbams(features, files, mode=Union,
                         reader=readGAlignmentPairs)

This code handles the case of multiple Bam with multiple annotations. To 
handle combinations for multiple arguments ('mode', 'ignore.strand', 
'inter.feature', etc.) you could add a Map() in the 'while' loop in 
count1bam. Instead of passing a single value you would pass lists the 
same length as the list of annotations. For example,

modeList <- c(Union, IntersectionStrict)
ignoreStrandList <- c(TRUE, TRUE)
interFeatureList <- c(TRUE, FALSE)

>
> (3) Is it currently possible to return the total numbers of aligned
> reads in bam files with summarizeOverlaps and store them in the
> resulting counting object? There are many places from where the numbers
> of aligned reads can be obtained, but to me the most obvious step to
> generate and store them would be right here. They are useful parameters
> for alignment QC'ing and reporting proper RPKM/FPKM values to biologists.
>

I've added the output of countBam() to the 'colData' slot of the 
SummarizedExperiment. The 'mapped' counts are computed with countBam() 
and a param where 'isUnmappedQuery=FALSE'. This isn't a great example 
but does demonstrate the new output.

library(Rsamtools)
library(pasillaBamSubset)
library("TxDb.Dmelanogaster.UCSC.dm3.ensGene")
exbygene <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
se <- summarizeOverlaps(exbygene, BamFile(untreated1_chr4()))

 > colData(se)
DataFrame with 1 row and 3 columns
                       records nucleotides    mapped
                     <integer>   <numeric> <integer>
untreated1_chr4.bam    204355    15326625    204355

Code updates are in GenomicRanges 1.13.15 and Rsamtools 1.13.16.

Valerie

> Thomas
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>