[BioC] Recommended gene model for DESeq

Assaf Gordon gordon at cshl.edu
Thu Apr 5 19:00:47 CEST 2012


Thank you for the quick answer.

A side-note regarding UCSC's GTF export:

Simon Anders wrote, On 04/05/2012 05:11 AM: 
>
> A subtle but very annoying problem is this:
> 
> According to the GTF file standard specs, an "exon" or "CDS" line in
> a GTF file shlould have two attributes labelled "gene_id" and
> "transcript_id". Exon lines describing different transcripts of the
> same gene should hence have different transcript IDs but the same
> gene ID. Ensembl honors this rule while GTF files downloaded from the
> UCSC Table browser just have the transcript ID repeated as gene ID,
> which defeats the purpose, of course. A script won't notice that two
> transcripts belong to the same gene because they do not have the same
> gene ID.
> 
> Possible fixes: (i) Use GTF files from Ensembl. (ii) Ask the UCSC
> people to fix the issue. (iii) Fix a UCSC GTF file by removing from
> the gene_id attibute the number behind the last dot. Hence, taking
> e.g., this line from the UCSC GTF file for hg19:
> 

The UCSC Table Browser indeed produces unsuitable GTF files, it's a known issue and they're not going to fix it (technically: the format conversion code uses a BED-like structure for all formats, and BED has only one ID per record).

The correct way to export usable GFF files directly from UCSC:

1. Download a program called "genePredToGtf" from here:
      http://hgdownload.cse.ucsc.edu/admin/exe/
   or compile it from source ( http://genome.ucsc.edu/admin/git.html ) if you're feeling lucky.

2. Create the following file in your home directory:
  $ cat ~/.hg.conf 
  db.host=genome-mysql.cse.ucsc.edu
  db.user=genomep
  db.password=password
  # the file's permissions must be user-only
  $ chmod 0600 ~/.hg.conf

3. run "genePredToGtf" with any organism and any table that is in "genePred" format:
  ## mm9/RefSeq Genes
  $ genePredToGtf mm9 refGene refGene.gtf

  ## hg19/Ensemble Genes
  $ genePredToGtf hg19 ensGene ensGene.gtf

  ## hg18/UCSC Known Genes
  $ genePredToGtf hg18 knownGene knownGene.gtf

This will save "refGene.gtf" with all the required attributes (gene_id, gene_name transcript_id, exon_number).
(but still, per your explanation, not directly usable with DESeq because of multiple isoforms per gene).

-gordon



More information about the Bioconductor mailing list