[BioC] Recommended gene model for DESeq
Simon Anders
anders at embl.de
Thu Apr 5 11:11:26 CEST 2012
Hi Gordon
On 04/04/2012 10:30 PM, Assaf Gordon wrote:
> I've read previous discussions about transcript vs. gene level [1]
> and exon level considerations [2] but perhaps I've missed the bottom
> line: Is it OK to have multiple isoforms per gene (and treat each
> transcript as "gene record", which will result in some
> double-counting of reads), or do I need to pre-process the gene model
> file, to ensure there are no overlaps (e.g. by merging all isoforms
> of a single gene) ? Or, is some post-processing needed to the DESeq
> results (from nbinomTest()) to "normalize" genes with multiple
> isoforms?
You should make sure that each read is counted only once per gene.
Hence, counting per transcript and the summing up the counts for the
transcripts from the same gene is not a good idea.
I use my htseq-count script (part of HTSeq,
http://www-huber.embl.de/users/anders/HTSeq/). This Python script takes
a GTF file and a SAM file and output a count table per gene. To this
end, it finds all the exons with which the read overlaps and then checks
whether they all have the same Gene ID. If so, it counts the read for
this gene.
If you want to stay in R: Valerie Obenchain has recently added
functionality to GenomicRanges to perform counting in a way similar to
my htseq-count script, and described this in the vignette
'countGenomicOverlaps.pdf' (in the GenomicRanges package). Also look at
Nico Delhomme's new package 'easyRNASeq' which is also meant to
facilitate this task.
A subtle but very annoying problem is this:
According to the GTF file standard specs, an "exon" or "CDS" line in a
GTF file shlould have two attributes labelled "gene_id" and
"transcript_id". Exon lines describing different transcripts of the same
gene should hence have different transcript IDs but the same gene ID.
Ensembl honors this rule while GTF files downloaded from the UCSC Table
browser just have the transcript ID repeated as gene ID, which defeats
the purpose, of course. A script won't notice that two transcripts
belong to the same gene because they do not have the same gene ID.
Possible fixes: (i) Use GTF files from Ensembl. (ii) Ask the UCSC people
to fix the issue. (iii) Fix a UCSC GTF file by removing from the gene_id
attibute the number behind the last dot. Hence, taking e.g., this line
from the UCSC GTF file for hg19:
chr1 hg19_knownGene exon 16858 17055 0.000000 - . gene_id "uc009vjd.2";
transcript_id "uc009vjd.2";
change it to this
chr1 hg19_knownGene exon 16858 17055 0.000000 - . gene_id "uc009vjd";
transcript_id "uc009vjd.2";
This change should be easy to make on the fly by any of the three
methods mentioned above. At least mine does not offer this
functionality, but I should add it, I suppose. Valerie and Nico, how do
your functions deal with this?
Simon
More information about the Bioconductor
mailing list