[BioC] multiple feature/mode counting with summarizeOverlaps
Valerie Obenchain
vobencha at fhcrc.org
Thu May 23 23:06:10 CEST 2013
Hi Thomas,
On 05/20/2013 10:12 PM, Thomas Girke wrote:
> I apologize up front to Valarie for posting additional questions related
> to her excellent read counter. This time my questions are related to the
> economics of accessing big files as few times as possible to improve
> workflow performance in large RNA-Seq projects and to subsequently allow
> deleting all input bam files since BIG DATA is killing us :).
>
> (1) What is currently the intended approach to obtain counts for
> multiple feature types in a single pass-through of one or many bam
> files? For instance, often we want to count the reads overlapping with
> many different features types (not just one), such as exonBygenes,
> cdsBygenes, UTRs, introns and intergenics. Computing all those feature
> counts sequentially in a loop is easy, but can be time consuming and
> heavy on disk I/O (the most expensive resource in NGS analysis) and
> internal networks when running many counting processes in parallel. As a
> partial solution to this, I often organize all feature types in one big
> GRanges/GRangesList container, perform the counting and then split the
> resulting count tables accordingly. However, is this really the best way
> of doing this? A major limitation of this approach is that it does not
> support feature overlap aware counting modes.
I would not recommend combining different annotations into a single
GRanges/GRangesList. To summarizeOverlaps(), each row of a GRanges or
each list element of a GRangesList represents a feature. If a read hits
more than one feature we have a double hit situation that must be
resolved. If you're interested in counting both exonsByGene and
transcriptsByGene you probably want these to be independent counts. By
combining them into one object they can be confounding and create
multiple hits situations where there really aren't any.
Toy example, but let's say 'gr1' comes from exonsBy() and 'gr2' from
3UTRsByTranscript().
gr1 <- GRanges(c("chr1", "chr1"), IRanges(c(1, 10), width=10))
gr2 <- GRanges("chr1", IRanges(25, width=10))
grl <- GRangesList(gr1, gr2)
reads <- GAlignments("chr1", pos=18L, cigar="10M", strand=strand("*"))
The read hits gr1 and gr2. If counted separately,
> assays(summarizeOverlaps(grl[1], reads, Union))$counts
reads
[1,] 1
> assays(summarizeOverlaps(grl[2], reads, Union))$counts
reads
[1,] 1
Now count as GRangesList,
> assays(summarizeOverlaps(grl, reads, Union))$counts
reads
[1,] 0
[2,] 0
Using 'inter.feature=FALSE' will regain the hits with modes Union and
IntersectionStrict but not necessarily IntersectionNotEmpty because
shared regions of the annotation are removed before counting.
> assays(summarizeOverlaps(grl, reads, Union, inter.feature=FALSE))$counts
reads
[1,] 1
[2,] 1
>
> (2) Similarly, what if I want to generate counts for multiple counting
> modes in a single call to summarizeOverlaps, such as different overlap
> modes, or sense and antisense counts which we often want to report when
> working with strand specific RNA-Seq data? Are there plans to support
> this? BTW: currently, I generate antisense read counts by inversing the
> strand information of the features which works fine just not for both
> ways in one step.
We don't have a built-in method to count Bams over multiple annotations
and multiple count modes. There are several approaches you could take.
After some discussion this was one we came up with.
## Count a single file.
count1bam <-
function(file, features, mode, ...,
yieldSize=1000000L, reader=readGAlignments,
param=ScanBamParam())
{
## initialize
bf <- open(BamFile(file, yieldSize=yieldSize))
counts <- lapply(sapply(features, length), integer)
## iterate
while(length(reads <- reader(bf, param=param)))
for (i in seq_along(features))
counts[[i]] <-
counts[[i]] + mode(features[[i]], reads, ...)
close(bf)
counts
}
## Count over files in parallel.
countbams <-
function(features, files, ...)
{
counts0 <- mclapply(files, count1bam, features=features, ...)
counts <- lapply(names(features), function(i, counts) {
sapply(counts0, "[[", i)
}, counts0)
}
## In practice.
library(parallel); options(mc.cores=detectCores())
library(GenomicRanges)
library(Rsamtools)
features <-
list(coarse=GRanges("seq1", successiveIRanges(rep(100, 15))),
fine=GRanges("seq2", successiveIRanges(rep(20, 45))))
fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
files <- list(fl, fl) ## usually different files
## As 'mode' you can directly call Union, IntersectionStrict,
## and IntersectionNotEmpty.
result <- countbams(features, files, mode=Union,
ignore.strand=TRUE, inter.feature=FALSE)
## 'reader' specifies how the data are read in. For paired-end
## use readGAlignmentsList (qname sorted or readGAlignmentPairs.
result <- countbams(features, files, mode=Union,
reader=readGAlignmentPairs)
This code handles the case of multiple Bam with multiple annotations. To
handle combinations for multiple arguments ('mode', 'ignore.strand',
'inter.feature', etc.) you could add a Map() in the 'while' loop in
count1bam. Instead of passing a single value you would pass lists the
same length as the list of annotations. For example,
modeList <- c(Union, IntersectionStrict)
ignoreStrandList <- c(TRUE, TRUE)
interFeatureList <- c(TRUE, FALSE)
>
> (3) Is it currently possible to return the total numbers of aligned
> reads in bam files with summarizeOverlaps and store them in the
> resulting counting object? There are many places from where the numbers
> of aligned reads can be obtained, but to me the most obvious step to
> generate and store them would be right here. They are useful parameters
> for alignment QC'ing and reporting proper RPKM/FPKM values to biologists.
>
I've added the output of countBam() to the 'colData' slot of the
SummarizedExperiment. The 'mapped' counts are computed with countBam()
and a param where 'isUnmappedQuery=FALSE'. This isn't a great example
but does demonstrate the new output.
library(Rsamtools)
library(pasillaBamSubset)
library("TxDb.Dmelanogaster.UCSC.dm3.ensGene")
exbygene <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
se <- summarizeOverlaps(exbygene, BamFile(untreated1_chr4()))
> colData(se)
DataFrame with 1 row and 3 columns
records nucleotides mapped
<integer> <numeric> <integer>
untreated1_chr4.bam 204355 15326625 204355
Code updates are in GenomicRanges 1.13.15 and Rsamtools 1.13.16.
Valerie
> Thomas
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list