[BioC] Recommended gene model for DESeq

Simon Anders anders at embl.de
Thu Apr 5 11:11:26 CEST 2012


Hi Gordon

On 04/04/2012 10:30 PM, Assaf Gordon wrote:
> I've read previous discussions about transcript vs. gene level [1]
> and exon level considerations [2] but perhaps I've missed the bottom
> line: Is it OK to have multiple isoforms per gene (and treat each
> transcript as "gene record", which will result in some
> double-counting of reads), or do I need to pre-process the gene model
> file, to ensure there are no overlaps (e.g. by merging all isoforms
> of a single gene) ? Or, is some post-processing needed to the DESeq
> results (from nbinomTest()) to "normalize" genes with multiple
> isoforms?

You should make sure that each read is counted only once per gene. 
Hence, counting per transcript and the summing up the counts for the 
transcripts from the same gene is not a good idea.

I use my htseq-count script (part of HTSeq, 
http://www-huber.embl.de/users/anders/HTSeq/). This Python script takes 
a GTF file and a SAM file and output a count table per gene. To this 
end, it finds all the exons with which the read overlaps and then checks 
whether they all have the same Gene ID. If so, it counts the read for 
this gene.

If you want to stay in R: Valerie Obenchain has recently added 
functionality to GenomicRanges to perform counting in a way similar to 
my htseq-count script, and described this in the vignette 
'countGenomicOverlaps.pdf' (in the GenomicRanges package). Also look at 
Nico Delhomme's new package 'easyRNASeq' which is also meant to 
facilitate this task.


A subtle but very annoying problem is this:

According to the GTF file standard specs, an "exon" or "CDS" line in a 
GTF file shlould have two attributes labelled "gene_id" and 
"transcript_id". Exon lines describing different transcripts of the same 
gene should hence have different transcript IDs but the same gene ID. 
Ensembl honors this rule while GTF files downloaded from the UCSC Table 
browser just have the transcript ID repeated as gene ID, which defeats 
the purpose, of course. A script won't notice that two transcripts 
belong to the same gene because they do not have the same gene ID.

Possible fixes: (i) Use GTF files from Ensembl. (ii) Ask the UCSC people 
to fix the issue. (iii) Fix a UCSC GTF file by removing from the gene_id 
attibute the number behind the last dot. Hence, taking e.g., this line 
from the UCSC GTF file for hg19:

chr1	hg19_knownGene	exon	16858	17055	0.000000	-	.	gene_id "uc009vjd.2"; 
transcript_id "uc009vjd.2";

change it to this

chr1	hg19_knownGene	exon	16858	17055	0.000000	-	.	gene_id "uc009vjd"; 
transcript_id "uc009vjd.2";


This change should be easy to make on the fly by any of the three 
methods mentioned above. At least mine does not offer this 
functionality, but I should add it, I suppose. Valerie and Nico, how do 
your functions deal with this?


   Simon



More information about the Bioconductor mailing list