[BioC] multiple feature/mode counting with summarizeOverlaps

Valerie Obenchain vobencha at fhcrc.org
Fri May 24 01:34:14 CEST 2013


Thanks Thomas. I appreciate the thoughts and the back-story details. 
These are helpful in shaping future decisions.

Valerie

On 05/23/2013 03:39 PM, Thomas Girke wrote:
> Thanks Varlerie for looking into these suggestions and improvements. I
> really appreciate the time and effort you keep investing into my
> unsolicited questions. - Obviously, read counting has become a major
> activity for many of us and your software will definitley put to great
> use. In our projects, flexiblity and accuracy of the read counter is
> most important to us, which summarizeOverlaps and associates address
> very well. Since research facilities like ours have to perform those
> analyses on very large disk storage devices with several hundred TBs of
> space that are shared among many NGS users, disk I/O and network speeds
> on compute clusters have become a major bottleneck. Thus, we are also
> very interested in solutions that collect counts for many feature types
> and counting modes in a single pass-through of these big files, as
> opposed to accessing them over and over again when it is not really
> necessary. Usually, I also like to delete the bam files early on in most
> of our RNA-Seq workflows to save storage space. However, our
> collaborators often keep coming up with all kinds of creative and
> exciting new ideas what features/modes to count on where the best
> analysis strategy is often to generate by default counts for many
> feature types right away (even if I don't intend and recommend or to use
> them) rather than only one prescibed solution that usually ends up not
> being enough. I guess RNA-Seq is still very much in an engineering stage
> where most of us depend on software design that provides a very high
> level of flexiblitly.
>
> Again, thanks a lot.
>
> Thomas
>
>
>
> On Thu, May 23, 2013 at 09:06:10PM +0000, Valerie Obenchain wrote:
>> Hi Thomas,
>>
>> On 05/20/2013 10:12 PM, Thomas Girke wrote:
>>> I apologize up front to Valarie for posting additional questions related
>>> to her excellent read counter. This time my questions are related to the
>>> economics of accessing big files as few times as possible to improve
>>> workflow performance in large RNA-Seq projects and to subsequently allow
>>> deleting all input bam files since BIG DATA is killing us :).
>>>
>>> (1) What is currently the intended approach to obtain counts for
>>> multiple feature types in a single pass-through of one or many bam
>>> files?  For instance, often we want to count the reads overlapping with
>>> many different features types (not just one), such as exonBygenes,
>>> cdsBygenes, UTRs, introns and intergenics. Computing all those feature
>>> counts sequentially in a loop is easy, but can be time consuming and
>>> heavy on disk I/O (the most expensive resource in NGS analysis) and
>>> internal networks when running many counting processes in parallel. As a
>>> partial solution to this, I often organize all feature types in one big
>>> GRanges/GRangesList container, perform the counting and then split the
>>> resulting count tables accordingly. However, is this really the best way
>>> of doing this?  A major limitation of this approach is that it does not
>>> support feature overlap aware counting modes.
>>
>> I would not recommend combining different annotations into a single
>> GRanges/GRangesList. To summarizeOverlaps(), each row of a GRanges or
>> each list element of a GRangesList represents a feature. If a read hits
>> more than one feature we have a double hit situation that must be
>> resolved. If you're interested in counting both exonsByGene and
>> transcriptsByGene you probably want these to be independent counts. By
>> combining them into one object they can be confounding and create
>> multiple hits situations where there really aren't any.
>>
>> Toy example, but let's say 'gr1' comes from exonsBy() and 'gr2' from
>> 3UTRsByTranscript().
>>
>> gr1 <- GRanges(c("chr1", "chr1"), IRanges(c(1, 10), width=10))
>> gr2 <- GRanges("chr1", IRanges(25, width=10))
>> grl <- GRangesList(gr1, gr2)
>> reads <- GAlignments("chr1", pos=18L, cigar="10M", strand=strand("*"))
>>
>>
>> The read hits gr1 and gr2. If counted separately,
>>
>>   > assays(summarizeOverlaps(grl[1], reads, Union))$counts
>>        reads
>> [1,]     1
>>   > assays(summarizeOverlaps(grl[2], reads, Union))$counts
>>        reads
>> [1,]     1
>>
>> Now count as GRangesList,
>>
>>   > assays(summarizeOverlaps(grl, reads, Union))$counts
>>        reads
>> [1,]     0
>> [2,]     0
>>
>>
>> Using 'inter.feature=FALSE' will regain the hits with modes Union and
>> IntersectionStrict but not necessarily IntersectionNotEmpty because
>> shared regions of the annotation are removed before counting.
>>
>>   > assays(summarizeOverlaps(grl, reads, Union, inter.feature=FALSE))$counts
>>        reads
>> [1,]     1
>> [2,]     1
>>
>>>
>>> (2) Similarly, what if I want to generate counts for multiple counting
>>> modes in a single call to summarizeOverlaps, such as different overlap
>>> modes, or sense and antisense counts which we often want to report when
>>> working with strand specific RNA-Seq data? Are there plans to support
>>> this? BTW: currently, I generate antisense read counts by inversing the
>>> strand information of the features which works fine just not for both
>>> ways in one step.
>>
>> We don't have a built-in method to count Bams over multiple annotations
>> and multiple count modes. There are several approaches you could take.
>> After some discussion this was one we came up with.
>>
>>       ## Count a single file.
>>       count1bam <-
>>           function(file, features, mode, ...,
>>                    yieldSize=1000000L, reader=readGAlignments,
>>                    param=ScanBamParam())
>>       {
>>           ## initialize
>>           bf <- open(BamFile(file, yieldSize=yieldSize))
>>           counts <- lapply(sapply(features, length), integer)
>>
>>           ## iterate
>>           while(length(reads <- reader(bf, param=param)))
>>               for (i in seq_along(features))
>>                   counts[[i]] <-
>>                       counts[[i]] + mode(features[[i]], reads, ...)
>>           close(bf)
>>           counts
>>       }
>>
>>       ## Count over files in parallel.
>>       countbams <-
>>           function(features, files, ...)
>>       {
>>           counts0 <- mclapply(files, count1bam, features=features, ...)
>>           counts <- lapply(names(features), function(i, counts) {
>>               sapply(counts0, "[[", i)
>>           }, counts0)
>>       }
>>
>>       ## In practice.
>>       library(parallel); options(mc.cores=detectCores())
>>       library(GenomicRanges)
>>       library(Rsamtools)
>>       features <-
>>           list(coarse=GRanges("seq1", successiveIRanges(rep(100, 15))),
>>                fine=GRanges("seq2", successiveIRanges(rep(20, 45))))
>>       fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
>>       files <- list(fl, fl)  ## usually different files
>>
>>       ## As 'mode' you can directly call Union, IntersectionStrict,
>>       ## and IntersectionNotEmpty.
>>       result <- countbams(features, files, mode=Union,
>>                           ignore.strand=TRUE, inter.feature=FALSE)
>>
>>       ## 'reader' specifies how the data are read in. For paired-end
>>       ## use readGAlignmentsList (qname sorted or readGAlignmentPairs.
>>       result <- countbams(features, files, mode=Union,
>>                           reader=readGAlignmentPairs)
>>
>> This code handles the case of multiple Bam with multiple annotations. To
>> handle combinations for multiple arguments ('mode', 'ignore.strand',
>> 'inter.feature', etc.) you could add a Map() in the 'while' loop in
>> count1bam. Instead of passing a single value you would pass lists the
>> same length as the list of annotations. For example,
>>
>> modeList <- c(Union, IntersectionStrict)
>> ignoreStrandList <- c(TRUE, TRUE)
>> interFeatureList <- c(TRUE, FALSE)
>>
>>>
>>> (3) Is it currently possible to return the total numbers of aligned
>>> reads in bam files with summarizeOverlaps and store them in the
>>> resulting counting object? There are many places from where the numbers
>>> of aligned reads can be obtained, but to me the most obvious step to
>>> generate and store them would be right here. They are useful parameters
>>> for alignment QC'ing and reporting proper RPKM/FPKM values to biologists.
>>>
>>
>> I've added the output of countBam() to the 'colData' slot of the
>> SummarizedExperiment. The 'mapped' counts are computed with countBam()
>> and a param where 'isUnmappedQuery=FALSE'. This isn't a great example
>> but does demonstrate the new output.
>>
>> library(Rsamtools)
>> library(pasillaBamSubset)
>> library("TxDb.Dmelanogaster.UCSC.dm3.ensGene")
>> exbygene <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
>> se <- summarizeOverlaps(exbygene, BamFile(untreated1_chr4()))
>>
>>   > colData(se)
>> DataFrame with 1 row and 3 columns
>>                         records nucleotides    mapped
>>                       <integer>   <numeric> <integer>
>> untreated1_chr4.bam    204355    15326625    204355
>>
>>
>>
>> Code updates are in GenomicRanges 1.13.15 and Rsamtools 1.13.16.
>>
>> Valerie
>>
>>> Thomas
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>



More information about the Bioconductor mailing list