[Bioc-devel] makeTranscriptDBFromGFF v. Flybase GFF
mcarlson at fhcrc.org
Tue Feb 12 01:32:02 CET 2013
Not too much that hasn't been mentioned before. So I bet that many
people can probably walk past this one.
Both GFF and GTF files have many of the same things that come up when
you use them. They both are being used for things today (like
transcriptomes) which represent a pretty specific use case. And both
these file formats were designed a while ago now, and some kinds of
information (like exon rank) that are completely crucial for doing
something like a transcriptome are therefore still optional when making
a GFF or GTF file. Also, because these file formats are very flexible
and general in their specification, it is possible for them to be either
overly sparse, OR overly loaded with unnecessary stuff (depending on
what you were planning to use them for). So it is completely possible
that the ensembl file may be smaller and yet still contain what you
need. Or it might not be smaller. You will simply have to check it and
see how it compares.
If you are using my function makeTranscriptDBFromGFF() from the
GenomicFeatures package, it will try to check and see if the file has
all the required information for you as it processes it into a
transcriptDb object. If you are calling this, the only thing you really
have to be "extra careful" about is the exon rank attribute. This
function can "guess" at that information for you, but I am betting you
don't want that if you can avoid it (which is why you will get a warning
if this happens). So for these data, you really want to point to an
attribute that has that information (if that is possible).
In addition to seeing problems where a file will have too much or too
little information, you will also sometimes see a file that is formatted
in some peculiar way that requires you to translate it into a more
typical looking GFF or GTF file. This can happen to you because as I
mentioned above the file formats are fairly general and open to some
interpretation by those who write them out. In general I think the most
important piece of advice is that you should always look at GFF or GTF
files in person before you try to use them, because you can't really be
too sure about what kind of information will be in there unless you do.
The bottom line is that both ensembl and flybase are reputable places to
get data from. But because they are different places, they may produce
dramatically different looking GFF or GTF files.
Also related to this, please be sure to use the very latest version of
makeTranscriptDBFromGFF from the devel branch, as I have made some
improvements for performance since the release.
I hope this helps,
On 02/11/2013 03:13 PM, Cook, Malcolm wrote:
> Marc et. al.,
> A colleague of mine (cc:ed) is experiencing memory bloat using makeTranscriptDBFromGFF on dmel GFF from Flybase.org
> I told him of my success in using Ensembl's GTF-ization but that I would check in with you (et al).
> Do you have any advice/warnings/gothcas/toldyasos/caveats re: applying makeTranscriptDBFromGFF to Flybase
More information about the Bioc-devel