[BioC] makeTranscriptDbFromGFF Error for UCSC GTF File
Marc Carlson
mcarlson at fhcrc.org
Wed Jul 2 21:59:11 CEST 2014
OK,
It looks like Herve may have found another problem with the code that
tries to guess the order of the exon ranking when it is not provided by
these files. I will look into that.
But for now I think that people should instead do this (which works
right now):
library(GenomicFeatures)
txdb <- makeTranscriptDbFromUCSC(genome="hg19",tablename="refGene")
And another thing that relates to this is that I really don't think you
want to use the GTF files being generated here for making TranscriptDb
objects.
The reason is because Simon is correct about the GTF file coming from
UCSC here. It's not a good one. It may also have the problem that
Simon describes with the bad IDs, but it definitely also has another
serious problem that traces back to a deficiency in the GTF file format
for this use case (which is: representing a transcriptome).
The problem is that you can make a valid GTF file and not include any
exon ranking information. That means that the file can tell you about
ranges for all the exons but not ever tell you what order they should be
in to make a corresponding transcript. And that's not good as it is
still woefully incomplete for specifying a transcriptome. Now for some
really simple organisms that have very little splicing this might be
OK. But for humans it is almost certainly not OK which is why my code
is throwing all those warnings in your output.
This absence basically means that you then have to guess that the exons
occur in the order that they occur along a chromosome. Now sometimes
you can tell that it is not OK to make this kind of guess even with the
limited data that you have - such as when exons are from other
chromosomes - and these kinds of cases get thrown out - under protest:
with even more warnings. But in any case, my feeling is that you should
probably never use the arguments to make that kind of guess when dealing
with something that has a transcriptome as complex as humans. So if you
have a GTF file and it's for a complex organism then I really think you
should insist that that file contains exon rankings.
This is what I was alluding to when I said that the GTF file format has
a use case that is more general than that of storing a complete
transcriptome. Can you use the format to store a legimate
transcriptome? Yes of course.
But can you rely on the file format to ensure that a valid file will
always contains a complete transcriptome? Unfortunately: the answer is
no. Some GTF files will just not have the information needed to specify
a transcriptome since not all the information that you need to specify
that is required by the format.
Hope this clarifies things,
Marc
On 07/02/2014 11:34 AM, Simon Anders wrote:
> Hi
>
> On 02/07/14 20:17, Michael Lawrence wrote:
>>> In contrast, using GTF or GFF files for making TranscriptDb objects is
>>> always a little risky because many of these files will not have been
>>> created with the intention of holding a transcriptome as data (which
>>> is the
>>> specific thing that a TranscriptDb object is meant to hold). This is
>>> because the GTF and GFF file formats were not initially intended for
>>> the
>>> specific purpose of holding a transcriptome but were instead
>>> intended to be
>>> something more general.
>>>
>>>
>> Actually GTF (Gene Transfer Format) files are designed specifically for
>> representing gene models, and we have no excuse for not parsing them
>> correctly. There have been some tweaks to attribute parsing (I thought
>> limited to GFF3) in devel, so there may be a difference between Herve's
>> devel result and Dario's release result. I'll try to find some time to
>> look into this.
>
> The problem with GTF files produced by the UCSC Table Browser is that
> they contain incorrect gene IDs: The gene_id attribute is always set
> to the same value as the transcript_id, and these files hence cannot
> be used to define gene models without manual correction (which would
> be to remove the transcript number suffix from the gene IDs).
>
> Long ago, I have asked the UCSC Genome Browser help-desk why this is
> and was told that it is a bug in the Table Browser which they cannot
> fix, for some reason.
>
> Hence, I usually advise to not use these files.
>
> Simon
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list