[BioC] Recommended gene model for DESeq

Thu Apr 5 21:36:24 CEST 2012

On 04/05/2012 02:11 AM, Simon Anders wrote:
> Hi Gordon
>
> On 04/04/2012 10:30 PM, Assaf Gordon wrote:
>> I've read previous discussions about transcript vs. gene level [1]
>> and exon level considerations [2] but perhaps I've missed the bottom
>> line: Is it OK to have multiple isoforms per gene (and treat each
>> transcript as "gene record", which will result in some
>> double-counting of reads), or do I need to pre-process the gene model
>> file, to ensure there are no overlaps (e.g. by merging all isoforms
>> of a single gene) ? Or, is some post-processing needed to the DESeq
>> results (from nbinomTest()) to "normalize" genes with multiple
>> isoforms?
>
> You should make sure that each read is counted only once per gene. 
> Hence, counting per transcript and the summing up the counts for the 
> transcripts from the same gene is not a good idea.
>
> I use my htseq-count script (part of HTSeq, 
> http://www-huber.embl.de/users/anders/HTSeq/). This Python script 
> takes a GTF file and a SAM file and output a count table per gene. To 
> this end, it finds all the exons with which the read overlaps and then 
> checks whether they all have the same Gene ID. If so, it counts the 
> read for this gene.
>
> If you want to stay in R: Valerie Obenchain has recently added 
> functionality to GenomicRanges to perform counting in a way similar to 
> my htseq-count script, and described this in the vignette 
> 'countGenomicOverlaps.pdf' (in the GenomicRanges package). 

We renamed the vignette to 'Overview of summarizeOverlaps'. Also see 
?summarizeOverlaps.

Valerie
> Also look at Nico Delhomme's new package 'easyRNASeq' which is also 
> meant to facilitate this task.
>
>
> A subtle but very annoying problem is this:
>
> According to the GTF file standard specs, an "exon" or "CDS" line in a 
> GTF file shlould have two attributes labelled "gene_id" and 
> "transcript_id". Exon lines describing different transcripts of the 
> same gene should hence have different transcript IDs but the same gene 
> ID. Ensembl honors this rule while GTF files downloaded from the UCSC 
> Table browser just have the transcript ID repeated as gene ID, which 
> defeats the purpose, of course. A script won't notice that two 
> transcripts belong to the same gene because they do not have the same 
> gene ID.
>
> Possible fixes: (i) Use GTF files from Ensembl. (ii) Ask the UCSC 
> people to fix the issue. (iii) Fix a UCSC GTF file by removing from 
> the gene_id attibute the number behind the last dot. Hence, taking 
> e.g., this line from the UCSC GTF file for hg19:
>
> chr1    hg19_knownGene    exon    16858    17055    0.000000    -    
> .    gene_id "uc009vjd.2"; transcript_id "uc009vjd.2";
>
> change it to this
>
> chr1    hg19_knownGene    exon    16858    17055    0.000000    -    
> .    gene_id "uc009vjd"; transcript_id "uc009vjd.2";
>
>
> This change should be easy to make on the fly by any of the three 
> methods mentioned above. At least mine does not offer this 
> functionality, but I should add it, I suppose. Valerie and Nico, how 
> do your functions deal with this?
>
>
>   Simon
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor