[BioC] Recommended gene model for DESeq
Valerie Obenchain
vobencha at fhcrc.org
Thu Apr 5 21:36:24 CEST 2012
On 04/05/2012 02:11 AM, Simon Anders wrote:
> Hi Gordon
>
> On 04/04/2012 10:30 PM, Assaf Gordon wrote:
>> I've read previous discussions about transcript vs. gene level [1]
>> and exon level considerations [2] but perhaps I've missed the bottom
>> line: Is it OK to have multiple isoforms per gene (and treat each
>> transcript as "gene record", which will result in some
>> double-counting of reads), or do I need to pre-process the gene model
>> file, to ensure there are no overlaps (e.g. by merging all isoforms
>> of a single gene) ? Or, is some post-processing needed to the DESeq
>> results (from nbinomTest()) to "normalize" genes with multiple
>> isoforms?
>
> You should make sure that each read is counted only once per gene.
> Hence, counting per transcript and the summing up the counts for the
> transcripts from the same gene is not a good idea.
>
> I use my htseq-count script (part of HTSeq,
> http://www-huber.embl.de/users/anders/HTSeq/). This Python script
> takes a GTF file and a SAM file and output a count table per gene. To
> this end, it finds all the exons with which the read overlaps and then
> checks whether they all have the same Gene ID. If so, it counts the
> read for this gene.
>
> If you want to stay in R: Valerie Obenchain has recently added
> functionality to GenomicRanges to perform counting in a way similar to
> my htseq-count script, and described this in the vignette
> 'countGenomicOverlaps.pdf' (in the GenomicRanges package).
We renamed the vignette to 'Overview of summarizeOverlaps'. Also see
?summarizeOverlaps.
Valerie
> Also look at Nico Delhomme's new package 'easyRNASeq' which is also
> meant to facilitate this task.
>
>
> A subtle but very annoying problem is this:
>
> According to the GTF file standard specs, an "exon" or "CDS" line in a
> GTF file shlould have two attributes labelled "gene_id" and
> "transcript_id". Exon lines describing different transcripts of the
> same gene should hence have different transcript IDs but the same gene
> ID. Ensembl honors this rule while GTF files downloaded from the UCSC
> Table browser just have the transcript ID repeated as gene ID, which
> defeats the purpose, of course. A script won't notice that two
> transcripts belong to the same gene because they do not have the same
> gene ID.
>
> Possible fixes: (i) Use GTF files from Ensembl. (ii) Ask the UCSC
> people to fix the issue. (iii) Fix a UCSC GTF file by removing from
> the gene_id attibute the number behind the last dot. Hence, taking
> e.g., this line from the UCSC GTF file for hg19:
>
> chr1 hg19_knownGene exon 16858 17055 0.000000 -
> . gene_id "uc009vjd.2"; transcript_id "uc009vjd.2";
>
> change it to this
>
> chr1 hg19_knownGene exon 16858 17055 0.000000 -
> . gene_id "uc009vjd"; transcript_id "uc009vjd.2";
>
>
> This change should be easy to make on the fly by any of the three
> methods mentioned above. At least mine does not offer this
> functionality, but I should add it, I suppose. Valerie and Nico, how
> do your functions deal with this?
>
>
> Simon
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list