[Bioc-devel] makeTranscriptDBFromGFF v. Flybase GFF

Marc Carlson mcarlson at fhcrc.org
Tue Feb 12 01:32:02 CET 2013


Hi Malcolm,

Not too much that hasn't been mentioned before.  So I bet that many 
people can probably walk past this one.

Both GFF and GTF files have many of the same things that come up when 
you use them.  They both are being used for things today (like 
transcriptomes) which represent a pretty specific use case.  And both 
these file formats were designed a while ago now, and some kinds of 
information (like exon rank) that are completely crucial for doing 
something like a transcriptome are therefore still optional when making 
a GFF or GTF file.  Also, because these file formats are very flexible 
and general in their specification, it is possible for them to be either 
overly sparse, OR overly loaded with unnecessary stuff (depending on 
what you were planning to use them for).  So it is completely possible 
that the ensembl file may be smaller and yet still contain what you 
need.  Or it might not be smaller.  You will simply have to check it and 
see how it compares.

If you are using my function makeTranscriptDBFromGFF() from the 
GenomicFeatures package, it will try to check and see if the file has 
all the required information for you as it processes it into a 
transcriptDb object.  If you are calling this, the only thing you really 
have to be "extra careful" about is the exon rank attribute.  This 
function can "guess" at that information for you, but I am betting you 
don't want that if you can avoid it (which is why you will get a warning 
if this happens).  So for these data, you really want to point to an 
attribute that has that information (if that is possible).

In addition to seeing problems where a file will have too much or too 
little information, you will also sometimes see a file that is formatted 
in some peculiar way that requires you to translate it into a more 
typical looking GFF or GTF file.  This can happen to you because as I 
mentioned above the file formats are fairly general and open to some 
interpretation by those who write them out.  In general I think the most 
important piece of advice is that you should always look at GFF or GTF 
files in person before you try to use them, because you can't really be 
too sure about what kind of information will be in there unless you do.

The bottom line is that both ensembl and flybase are reputable places to 
get data from.  But because they are different places, they may produce 
dramatically different looking GFF or GTF files.


Also related to this, please be sure to use the very latest version of 
makeTranscriptDBFromGFF from the devel branch, as I have made some 
improvements for performance since the release.


I hope this helps,



   Marc




On 02/11/2013 03:13 PM, Cook, Malcolm wrote:
> Marc et. al.,
>
> A colleague of mine (cc:ed) is experiencing memory bloat using makeTranscriptDBFromGFF on dmel GFF from Flybase.org
>
> I told him of my success in using Ensembl's GTF-ization but that I would check in with you (et al).
>
> So....
>
> Do you have any advice/warnings/gothcas/toldyasos/caveats re: applying makeTranscriptDBFromGFF to Flybase
>
> Thanks!
>
> Cheers,
>
> Malcolm
>



More information about the Bioc-devel mailing list