[Bioc-devel] [BioC] counting RNA-seq reads in R/BioC

Tue Sep 11 13:33:48 CEST 2012

Hi Mark,

Thanks a lot for that detailed analysis. That's something I've been wanting to do for a long time, just never got the time to.

On 11 Sep 2012, at 12:48, Mark Robinson wrote:

> Hi all,
> 
> Sorry for cross-posting, but I figured this might be of interest to both developers and users.
> 
> My student (Xiaobei Zhou) and I have been playing around with the "simple" task of counting RNA-seq reads in Bioconductor and we've collected a few observations.  My apologies in advance if I've made errors in calling the software or if I've omitted other methods to try.  Please correct us if so.
> 
> Here is a (hopefully) concise summary of what we've found:
> 
> 1. There are now various options in Bioconductor to go from (mapped) RNA-seq reads to counts at the gene, transcript, exon level.  The main players, as we understand it, are: easyRNASeq, summarizeOverlaps() in GenomicRanges and featureCounts() in Rsubread.  Outside of Bioconductor, we have used htseq-count from Simon Anders and treat this as the standard, since it's fairly mature.  For reference, we've collected code segments for all methods below, in case anyone wants to repeat this, or needs this for their data (of course, as a resource, the mailing list is not the best place to capture this).
> 
> 2. For single-end reads, the methods are quite similar, although differences exist.  I attribute this to differences in how reads are matched to features (e.g. overhanging reads) and/or filtering.  Nailing down the exact details are a bit hard to figure out, but I expect these to be minor overall.  Here is an example: http://imlspenticton.uzh.ch/se.png 

The slight difference that can be observed using easyRNASeq comes from starting from a bp coverage summarization rather than a read count summarization. There's a slim explanation of that in the RnaSeqTutorial vignette.

> 
> 
> 3. For paired-end reads, there is (or at least can be) a distinct divergence in read counts: http://imlspenticton.uzh.ch/pe.png ... at present, easyRNASeq and featureCounts()/Rsubread "overcount" reads and summarizeOverlap() "undercounts", relative to htseq-count.

easyRNASeq will indeed overcount reads depending on the annotation. I've done (I think) a detailed explanation of this in the easyRNASeq vignette and the package emits a number of warnings to report such cases when they occur. Due to the large annotation choices and the uncountable amount of new genomes, it is difficult to have a single gold standard method to clean (i.e. render an annotation robust so as to avoid double counting) the annotation. It is on my TODO list to add in the vignette a description on how to get an annotation transformed in a way similar to what is done by htseq. I've been discussing this with Simon extensively in the past, and I really just miss the time to write it down. As a side note, this is a project that would be really suited for the new website feature that Paul Shannon mentioned recently.

> 
> Wei Shi (Rsubread author) has recently mentioned that this is the way he likes to count:
> https://stat.ethz.ch/pipermail/bioconductor/2012-June/046440.html
> 

I've haven't check that but will.

> I've spoken with Nico Delhomme (easyRNASeq author) offline and accurately treating mate reads is on his TODO list.

That's correct. Same time issue as before.

> 
> The undercounting in summarizeOverlaps(), as far as we can tell, is due to filtering singleton reads (one read of the mate is mapped, the other is not) and not counting them.
> 
> 4. I have some results on memory usage and CPU time.  The packages available have different strategies to manage the tradeoff, so they vary widely.

Can you post these ones too, that would be very interesting.

> 
> 
> A couple comments / suggestions:
> 
> 1. From a user perspective, it would be optimal to have all options available: count singletons or not, overhang or no overhang, counting a pair as a single fragment, etc.

I agree. I've been wanting to contact Bioconductor (from the mailing list emails that would be Valerie's domain, I suppose?) for consolidating this. I would be willing to spend time with Wei Shi and Valerie(?) to achieve that. Another point I think would be very valuable here is that these methods when applied to a set of files would return a SummarizedExperiments object (SummarizedExperiments are meant to be contained for NGS data, similar to the ExpressionSet used for microarray). I've added the support for SummarizedExperiment recently to easyRNASeq. In my opinion it would be great if more downstream analyses such as edgeR, DESeq, DEXSeq,... could use such object as input and ideally as output.

> 
> 2. Another level of flexibility that would be useful is counting on junctions.  I think the infrastructure for this is now available: 
> https://stat.ethz.ch/pipermail/bioconductor/2012-May/045506.html
> Is there anyone working on a wrapper that just takes the reads (or pointer to BAM) and features and adds junction-level counts to the exon-level ones?
> 

If by junction level reads, you're meaning reads that span exon-exon junction (EEJ), easyRNASeq (using GenomicRanges methods) applied to BAM files which reads have been aligned using aligners that offer gap alignments and are EEJ aware (tophat, RSEM, GSNAP,...), takes such reads into account. But I'm not sure I'm talking about the same you are. 

> 
> Please discuss.

I'm looking forward to discussing this. I would propose that we first take this discussion offline to detail our different approaches. Hopefully, this will help us get to a consensus of the approach(es) to be undertaken. This could then be posted on the mailing list for getting feedback from the community.

> 
> These are observations, not criticisms.  

And they are great and long deserved. Thanks for that.

> I sincerely thank the contributors for their excellent packages; it's very nice to have multiple options available.
> 

Best,

Nico

> 
> Best regards,
> Mark
> 
> 
> 
> 
> # Do not expect this to run.  Populate a TXT file 
> # listing all the filenames (BAM/SAM)
> 
> # Given: metadata with file names
> # starting from GTF files and BAM/SAM, derive tables of counts
> sample.inform <- read.table("samples_bam_sam_etc.txt", header = TRUE, stringsAsFactors = FALSE)
> 
> ## -------------------
> ## easyRNASeq
> ## -------------------
> 
> library(easyRNASeq)
> library(BSgenome.Dmelanogaster.UCSC.dm3)
> 
> gtf <- "Drosophila_melanogaster.BDGP5.67.gtf"
> bamFlsR <- sample.inform[, "bamfile_resort"]
> 
> c_ers <- easyRNASeq(organism = "Dmelanogaster", 
>                    chr.sizes = as.list(seqlengths(Dmelanogaster)),
>                    annotationMethod = "gtf", annotationFile= gtf, 
>                    format = "bam",gapped = TRUE, count = "genes", 
>                    summarization = "geneModels", 
>                    filenames = bamFlsR, filesDirectory = dir)
> 
> 
> ## -------------------
> ## countOverlaps
> ## -------------------
> 
> library(GenomicFeatures)
> library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
> 
> txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
> gnModel <- exonsBy(txdb, "gene")
> bamFls <- paste(sample.inform[, "bamfile_s"], sep = "")
> 
> 
> UCSC2FlybaseGRanges<- function (GRanges) {
> ### Rename the chromosomes in<GRanges>  from UCSC conventions ('chr1',
> ### etc) to comport with Flybase conventions ('1', etc) by stripping
> ### leading 'chr' and translating 'M' as 'dmel_mitochondrion_genome'.
> seqlevels(GRanges)<- sub('chr','',seqlevels(GRanges))
> seqlevels(GRanges)<- sub('M','dmel_mitochondrion_genome',seqlevels(GRanges))
> GRanges
> }
> 
> gnModelT <- UCSC2FlybaseGRanges(gnModel)
> 
> counter <- function(fl, gnModel)
> {
>    aln <- readGappedAlignments(fl)
>    strand(aln) <- "*" # for strand-blind sample prep protocol
>    hits <- countOverlaps(aln, gnModel)
>    counts <- countOverlaps(gnModel, aln[hits==1])
>    names(counts) <- names(gnModel)
>    counts
> }
> c_conOver <- sapply(bamFls, counter, gnModel = gnModelT)
> 
> ## -------------------
> ## summarizeOverlaps using UCSC annotation
> ## -------------------
> 
> library(GenomicRanges)
> library(Rsamtools)
> library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
> library(GenomicFeatures)
> bamFlsR <- sample.inform[, "bamfile_resort"]
> # separate paired and unpaired
> bamFlsR_pe <- bamFlsR[sample.inform[, "libtype"] == "PE"]
> bamFlsR_se <- bamFlsR[sample.inform[, "libtype"] == "SE"]
> bamFlsR_peL <- BamFileList(bamFlsR_pe)
> bamFlsR_seL <- BamFileList(bamFlsR_se)
> txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
> gnModel <- exonsBy(txdb, "gene")
> 
> 
> UCSC2FlybaseGRanges<- function (GRanges) {
> ### Rename the chromosomes in<GRanges>  from UCSC conventions ('chr1',
> ### etc) to comport with Flybase conventions ('1', etc) by stripping
> ### leading 'chr' and translating 'M' as 'dmel_mitochondrion_genome'.
> seqlevels(GRanges)<- sub('chr','',seqlevels(GRanges))
> seqlevels(GRanges)<- sub('M','dmel_mitochondrion_genome',seqlevels(GRanges))
> GRanges
> }
> 
> gnModelT <- UCSC2FlybaseGRanges(gnModel)
> con_pe <- summarizeOverlaps(gnModelT, bamFlsR_peL, 
>                            mode = "Union", singleEnd=FALSE, ignore.strand = TRUE)
> con_se <- summarizeOverlaps(gnModelT, bamFlsR_seL, 
>                            mode = "Union", singleEnd=TRUE, ignore.strand = TRUE)
> c_summOver <- cbind(assays(con_pe)$counts, assays(con_se)$counts)
> colnames(count_summOver) <- c(basename(bamFlsR_pe), basename(bamFlsR_se))
> 
> 
> ## -------------------
> ## Rsubread
> ## -------------------
> 
> library(Rsubread)
> samFls <- sample.inform[, "samfile"]
> 
> gtf <- "Drosophila_melanogaster.BDGP5.67.gtf"
> z <- read.table(gtf, sep="\t", stringsAsFactors=FALSE)
> z <- z[z$V3=="exon",]
> ids <- gsub(" gene_id ","",sapply(strsplit(z$V9,";"),.subset,1))
> anno <- data.frame(entrezid=as.integer(factor(ids)), 
>                   chromosome=z$V1, chr_start=z$V4, chr_stop=z$V5)
> 
> ID <- sort(unique(anno$entrezid))
> genID <- ids[match(ID, anno$entrezid)]
> 
> c_rsub <- featureCounts(SAMfiles=samFls, annot=anno, type = "gene")$counts
> rownames(count_rsub) <-genID
> colnames(count_rsub) <-basename(samFls)
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ----------
> Prof. Dr. Mark Robinson
> Bioinformatics
> Institute of Molecular Life Sciences
> University of Zurich
> Winterthurerstrasse 190
> 8057 Zurich
> Switzerland
> 
> v: +41 44 635 4848
> f: +41 44 635 6898
> e: mark.robinson at imls.uzh.ch
> o: Y11-J-16
> w: http://tiny.cc/mrobin
> 
> ----------
> http://www.fgcz.ch/Bioconductor2012
> http://www.eccb12.org/t5
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor