[BioC] package to count read prior to DESeq?

Fri Feb 3 21:13:15 CET 2012

Hi Nico

Thanks for a quick and detailed reply. I am a tiny bit confused, may
be my question was not clear. Here is how I understand it. Please
correct me if I got it wrong.

What I would like to do is study differential gene expression at both
gene and transcript level.  As an input what I have is transcriptome
model in a gtf file and mapped bam files(unique hits only: I have
removed reads mapping to multiple positions in the genome). The gtf
file lists all the possible transcripts/exons for a gene.

What I need to get from this file two types of counts.
        1)  counts of reads that fall in each gene bin (strand specific)
        2)  count of reads that fall in each transcript (again strand sp)

What I am not sure is the counting method of easyRNASeq. Given an
annotation of gene with 2 transcripts (A, B) how does it do the
counting. If it is counting at gene level then at some point in the
calculation it should be merging the exons from transcripts A and B to
gene level.  And if it is counting at transcript level then we should
get two counts one for each transcript. I understand in the latter
case there will be double counting due to overlapping exons between
the transcripts.

It will help a lot if you could clear this.

PS: I also did try to use htseq-count for doing the same but the
counts I got dont quite match the avg number of reads we expect to see
/ gene or transcript. I guess I should start another thread on this
with Simon. May be I am doing something wrong.

Thanks!
-Abhi

On Fri, Feb 3, 2012 at 11:48 AM, Nicolas Delhomme <delhomme at embl.de> wrote:
> Dear Abhi,
>
> Yes, we can certainly tweak it to do so, but I'm not sure how much sense that would make.
>
> First, by doing so would in end effect result in counting read several times; i.e. reads involved in the same exons of different transcripts will be accounted for several time, so you would bias your data.
>
> Second, easyRNASeq is meant to start with an unmodified bam file, i.e. after any quality pre-processing has been done but before any kind of normalization/summarization. What easyRNASeq does is to compute the number of reads per feature (exon, transcript, gene,...) and normalize these results (or not) into RPKM or by using the DESeq or edgeR packages. Both these packages require "raw" unmodified data.
>
> Transforming transcript "counts" (or FPKM) into genes count is definitely not a straightforward process. You would need to be able to deconvolute every transcript effect, e.g. for a gene with two transcript A and B, you would need to establish the proportion of the different effect from the part unique to transcript A (the part not overlapping transcript B), that of transcript B and their common part. Knowing that Illumina has a significant (and somewhat hard to predict) read coverage bias along any transcript (luckily that bias is reproducible...), it makes it IMO hardly possible to achieve.
>
> If you are interested in gene counts, I would suggest to use easyRNASeq on its own, starting from your unmodified bam files. Then depending on your experiment, e.g. if you're doing some kind of differential expression analysis, apply DESeq, baySeq or edgeR.
>
> I hope this helps,
>
> Cheers,
>
> Nico
>
> ---------------------------------------------------------------
> Nicolas Delhomme
>
> Genome Biology Computational Support
>
> European Molecular Biology Laboratory
>
> Tel: +49 6221 387 8310
> Email: nicolas.delhomme at embl.de
> Meyerhofstrasse 1 - Postfach 10.2209
> 69102 Heidelberg, Germany
> ---------------------------------------------------------------
>
>
>
>
>
> On 3 Feb 2012, at 19:51, Abhishek Pratap wrote:
>
>> Hi Nico
>>
>> With respect to easyRNAeq package I am wondering if I supply it a
>> transcript level gtf file can it produce gene level counts. Basically
>> the file is the output from cufflinks.
>>
>> the last(attributes) column in the file does have the gene_id attribute.
>>
>>
>> Best,
>> -Abhi
>>
>>
>>
>> On Fri, Feb 3, 2012 at 8:06 AM, Nicolas Delhomme <delhomme at embl.de> wrote:
>>> Hi Nat,
>>>
>>> That's what the easyRNASeq (Bioc 2.10) pacakge exactly is for. Let me know if you need any help.
>>>
>>> Cheers,
>>>
>>> Nico
>>>
>>> ---------------------------------------------------------------
>>> Nicolas Delhomme
>>>
>>> Genome Biology Computational Support
>>>
>>> European Molecular Biology Laboratory
>>>
>>> Tel: +49 6221 387 8310
>>> Email: nicolas.delhomme at embl.de
>>> Meyerhofstrasse 1 - Postfach 10.2209
>>> 69102 Heidelberg, Germany
>>> ---------------------------------------------------------------
>>>
>>>
>>>
>>>
>>>
>>> On 3 Feb 2012, at 16:39, nathalie wrote:
>>>
>>>> Hi,
>>>> Is there a package to create a count table from bam files, which gives an output compatible with DESeq analysis?
>>>> thanks
>>>> Nat
>>>>
>>>>
>>>> --
>>>> The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>