[BioC] Recommended gene model for DESeq

Nicolas Delhomme delhomme at embl.de
Sat Apr 7 12:36:53 CEST 2012


On 6 Apr 2012, at 23:24, Assaf Gordon wrote:

> Thank you all for your responses.
> 
> I'm still looking for the optimal way to count hits for RNA-Seq Paired-end data,  may I ask for couple of clarifications?
> 
> Simon Anders wrote, On 04/05/2012 05:11 AM:
>> You should make sure that each read is counted only once per gene.
> 
> Once per gene - got it.
> What about a case where a read matches multiple genes? (described as "ambiguous" in HTSeq-Count/GenomicRanges "modes")
> Is it OK to count this read several times (once for each gene, multiple different genes), or would that invalidate the results?
> 
> It seems "easyRNASeq" will count a read multiple times (once per gene) when using "geneModels" summarization mode (based on [1], page 10) - so can it be used?
> 

That's only TRUE if the user does not pay attention to the warnings easyRNASeq is emitting. easyRNASeq creates geneModels per strand, so indeed genes overlapping on opposite strand my end up in double counting of reads. I do not want easyRNASeq to be a "black box", I want the user to understand what (s)he's doing and what (s)he's looking at. This is why I have many sanity checks that return warnings and guide the user. I can perfectly imagine that in a given tissue it is known that one gene is expressed whereas the gene on the other strand is not and that the user wants to keep the full geneModels for that first gene. What is still missing from the documentation (vignette) and that I need to add is how to rework the obtained gene model(s). I'm actually thinking of making the geneModel generation more flexible for the user, i.e. returning an annotated structure with synthetic exons (exonic regions as unique as possible, similar to what HTSeq does, i.e. including the complementary strand overlaps) and a set of methods that the user could apply to decide what (s)he's more interested in.

Another word of caution here: I do not check the unicity of the reads in the bam file. I expect the user to know enough about the aligner (s)he is using for his/her application to decide on that. From the literature it seems that people only retains uniquely mapped reads (for most applications). I could add a sanity check for that. 


> 
>> If you want to stay in R: Valerie Obenchain has recently added functionality to GenomicRanges to perform counting in a way similar to my htseq-count script
> 
> Related question: handling paired-end data correctly.
> It seems only GenomicRanges does not handle paired-end reads at all (based on [2], page 2, section "3. counting mode") - so the only option is "htseq-count" - is that correct ?
> I also couldn't find any mention of paired-end data in "easyRNASeq" PDF [1], so I don't know if it handles that or not.

I thought I wrote it in the vignette (I'll double check that). It can't at the moment but will soon; I'm working on it.

So at the moment you might be better set using htseq-count, but keep an eye open ;-)

Cheers,

Nico

> 
> 
> Thanks,
> -gordon
> 
> 
> 
> [1] http://bioconductor.org/packages/release/bioc/vignettes/easyRNASeq/inst/doc/easyRNASeq.pdf
> [2] http://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/summarizeOverlaps-modes.pdf
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list