[Bioc-devel] [BioC] GTF file error when using easyRNAseq

Sat Nov 16 23:34:10 CET 2013

Hej Martin!

That is indeed extremely useful. I’ll add that to my vignette as a source for annotation. One thing that comes immediately to mind is to have AnnotationHub and GenomicFeatures interact. I could imagine it would be useful to have e.g. a makeTranscriptDbFromAnnotationHub function. I admit I haven’t though about this in detail, but I already can think of use cases where such a function would come in handy.

Nico

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------

On 15 Nov 2013, at 20:13, Martin Morgan <mtmorgan at fhcrc.org> wrote:

> On 11/15/2013 10:22 AM, Michael Lawrence wrote:
>> Doesn't look like genomeIntervals has any C code (?), so a performance
>> comparison would be interesting. rtracklayer jumps through all sorts of
>> hoops to handle obscure things like URL encoding in GFF3. The code in
>> genomeIntervals seems more streamlined.
> 
> Wanted to mention, and it would be good to know if this was not helpful at all, that the Ensembl gtf files are available through AnnotationHub as GRanges objects
> 
> > library(AnnotationHub)
> > hub = AnnotationHub()
> > hub$ensembl.release.73.<tab>
> hub$ensembl.release.73.fasta. ... [378]
> hub$ensembl.release.73.gtf. ... [63]
> > xx = hub$ensembl.release.73.gtf.gallus_gallus.Gallus_gallus.Galgal4.73.gtf_0.0.1.RData
> > xx
> GRanges with 381368 ranges and 12 metadata columns:
>                 seqnames       ranges strand   |         source        type
>                    <Rle>    <IRanges>  <Rle>   |       <factor>    <factor>
>       [1]              1 [1735, 2449]      +   | protein_coding        exon
>       [2]              1 [2379, 2449]      +   | protein_coding         CDS
>               score     phase            gene_id      transcript_id
>           <numeric> <integer>        <character>        <character>
>       [1]      <NA>      <NA> ENSGALG00000009771 ENSGALT00000015891
>       [2]      <NA>         0 ENSGALG00000009771 ENSGALT00000015891
>           exon_number   gene_biotype            exon_id         protein_id
>             <numeric>    <character>        <character>        <character>
>       [1]           1 protein_coding ENSGALE00000301221               <NA>
>       [2]           1 protein_coding               <NA> ENSGALP00000015874
>                gene_name    transcript_name
>              <character>        <character>
>       [1]           <NA>               <NA>
>       [2]           <NA>               <NA>
> [ reached getOption("max.print") -- omitted 9 rows ]
>  ---
>  seqlengths:
>                    1                  2 ...     AADN03010940.1
>                   NA                 NA ...                 NA
> 
> Martin
> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Fri, Nov 15, 2013 at 10:14 AM, Nicolas Delhomme <delhomme at embl.de> wrote:
>> 
>>> Took that thread to the devel list, just feels more appropriate with
>>> regards to the content.
>>> 
>>> I already have that on my TODO list :-). This is not up-to-date, i.e. I
>>> haven’t done the comparison in ~2 years, but last time I did,
>>> genomeIntervals attribute parsing was faster than rtracklayer equivalent. I
>>> suppose that’s because it is already implemented in C in genomeIntervals.
>>> As said I don’t have any actual comparative numbers, still you might want
>>> to have a look at the genomeIntervals code. As I don’t think that
>>> genomeIntervals get as much exposition as rtracklayer does, many more
>>> people would benefit from an equivalent rtracklayer implementation. If
>>> you’re interested, I could do a performance comparison - based on my usual
>>> use case - between both packages.
>>> 
>>> Nico
>>> 
>>> ---------------------------------------------------------------
>>> Nicolas Delhomme
>>> 
>>> Genome Biology Computational Support
>>> 
>>> European Molecular Biology Laboratory
>>> 
>>> Tel: +49 6221 387 8310
>>> Email: nicolas.delhomme at embl.de
>>> Meyerhofstrasse 1 - Postfach 10.2209
>>> 69102 Heidelberg, Germany
>>> ---------------------------------------------------------------
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 15 Nov 2013, at 18:58, Michael Lawrence <lawrence.michael at gene.com>
>>> wrote:
>>> 
>>>> It might be worth taking a look at rtracklayer and the TranscriptDb
>>> stuff in GenomicFeatures. It could save you time, and if you notice any
>>> deficiencies in rtracklayer, it would help me. For example, if the
>>> attribute parsing is a bottleneck, I can push it down to C.
>>>> 
>>>> Michael
>>>> 
>>>> On Fri, Nov 15, 2013 at 8:23 AM, Nicolas Delhomme <delhomme at embl.de>
>>> wrote:
>>>> Hej Michael,
>>>> 
>>>> Good question really. I have a number of reason for this:
>>>> 
>>>> 1) I’ve been using the genomeIntervals readGff3 function for that - for
>>> years now - and I’ve always been satisfied by its performance, especially
>>> when parsing the gff/gtf ninth column. The parseGffAttribute and
>>> getGffAttribute functions are extremely convenient. I honestly haven’t
>>> checked if there was any recent development in rtracklayer /
>>> GenomicFeatures similar to these functions. If there were not, I think they
>>> would be a great addition to either package.
>>>> 
>>>> 2) As you might guess it’s essentially historical, back when I started
>>> that package in 2009, there was not today’s fantastic set of packages.
>>>> 
>>>> 3) As you painfully know, there’s about as many gff format as they are
>>> gff files, and because my package is a pipeline I really want to make sure
>>> that it’s output is consistent, hence I have strict requirement with
>>> regards to the gff/gtf format I accept. Which means that times and again, I
>>> have to do slight adjustment but I prefer that over outputting garbage.
>>>> 
>>>> 4) RNA-Seq analyses are filled with pitfalls, hence I think it is
>>> essential that users understand the data formats they handle and actually
>>> what these analyses are all about. I don’t want them to use my package as
>>> they would use a black box.
>>>> 
>>>> 5) It’s educational. There’s a vignette that describes how to parse and
>>> convert gff/gtf annotation in the minimal gff/gtf formatted file that would
>>> suit my package
>>>> 
>>>> Well, I suppose it’s more than you asked for, but here are my reasons
>>> ;-) You’re welcome to comment and I’d be happy to look again at rtracklayer
>>> (been through GenomicFeatures recently and I like it much) if you would
>>> advise me so.
>>>> 
>>>> Have a nice WE,
>>>> 
>>>> Cheers,
>>>> 
>>>> Nico
>>>> 
>>>> 
>>>> ---------------------------------------------------------------
>>>> Nicolas Delhomme
>>>> 
>>>> Genome Biology Computational Support
>>>> 
>>>> European Molecular Biology Laboratory
>>>> 
>>>> Tel: +49 6221 387 8310
>>>> Email: nicolas.delhomme at embl.de
>>>> Meyerhofstrasse 1 - Postfach 10.2209
>>>> 69102 Heidelberg, Germany
>>>> ---------------------------------------------------------------
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 15 Nov 2013, at 12:44, Michael Lawrence <lawrence.michael at gene.com>
>>> wrote:
>>>> 
>>>>> Why not use rtracklayer / GenomicFeatures for parsing GTF? That format
>>> is tough; no reason for everyone to take it on by themselves.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Nov 15, 2013 at 2:40 AM, Nicolas Delhomme <delhomme at embl.de>
>>> wrote:
>>>>> Hej Natalia!
>>>>> 
>>>>> There were a number of lines in that particular gtf that violated the
>>> assumptions I had about EnsEMBL gtf. Not all the fields in the attributes'
>>> column were always set and one of the gene name had a space character in
>>> it. I’ve made the parsing of gtf file annotation more flexible/lenient and
>>> that should resolve that particular issue you had. The changes should
>>> propagate in ~2 days to Bioc with easyRNASeq version 1.8.2.
>>>>> 
>>>>> Rather than using the geneModel, which implementation is old and has
>>> gotten slow because of changes in the underlying architecture, I prefer an
>>> approach where I
>>>>> 1) filter the gtf / gff annotation file for only those lines I’m
>>> interested in (e.g. of type exon, mRNA and gene for a gff file)
>>>>> 2) collapse every exon of a gene into what I call now a “synthetic
>>> transcript”. The reason for changing the naming from geneModel to synthetic
>>> transcript is that “gene model” has different meaning depending on the
>>> field of application.
>>>>> 3) use easyRNASeq with the modified annotation and the “transcript”
>>> count method.
>>>>> 
>>>>> I’ve detailed that procedure in the vignette section 7.1 .
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Nico
>>>>> 
>>>>> ---------------------------------------------------------------
>>>>> Nicolas Delhomme
>>>>> 
>>>>> Genome Biology Computational Support
>>>>> 
>>>>> European Molecular Biology Laboratory
>>>>> 
>>>>> Tel: +49 6221 387 8310
>>>>> Email: nicolas.delhomme at embl.de
>>>>> Meyerhofstrasse 1 - Postfach 10.2209
>>>>> 69102 Heidelberg, Germany
>>>>> ---------------------------------------------------------------
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On 12 Nov 2013, at 14:21, Nicolas Delhomme <delhomme at embl.de> wrote:
>>>>> 
>>>>>> Hej Natalia!
>>>>>> 
>>>>>> This is not the first time that I’ve seen this error on the list,
>>> but I’ve not been able to reproduce it so far with my own data. Would you
>>> mind sharing some data offline, just an excerpt of your files would do. If
>>> that’s OK, I can create and give access to a folder on my box account.
>>>>>> 
>>>>>> I had already relaxed the constraint on parsing a gtf file in a
>>> previous update but forgot to reflect the changes in the error message.
>>> Only the  gene_id and transcript_id are actually mandatory. I would not
>>> expect any issue with the EnsEMBL gtf file, but I’ll have a look at why it
>>> fails for Gallus galls one and let you know asap.
>>>>>> 
>>>>>> This as nothing to do with this error, but by looking at your
>>> command line, I saw that you provide a character vector to the chrSize
>>> argument. This is not necessary as this information is extracted from the
>>> bam file in your case, so you can just drop the chrSizes = “chrSizes” from
>>> your command line. I’ve added some extra check in the method to detect this
>>> now. Thanks :-)
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> Nico
>>>>>> 
>>>>>> ---------------------------------------------------------------
>>>>>> Nicolas Delhomme
>>>>>> 
>>>>>> Genome Biology Computational Support
>>>>>> 
>>>>>> European Molecular Biology Laboratory
>>>>>> 
>>>>>> Tel: +49 6221 387 8310
>>>>>> Email: nicolas.delhomme at embl.de
>>>>>> Meyerhofstrasse 1 - Postfach 10.2209
>>>>>> 69102 Heidelberg, Germany
>>>>>> ---------------------------------------------------------------
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 11 Nov 2013, at 16:31, Natalia [guest] <guest at bioconductor.org>
>>> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> Dear all,
>>>>>>> I would like to make a count table to use it in DESeq. I´ve tried
>>> to use easyRNAseq but I have a problem with the annotation file. I’ve
>>> downloaded the file Gallus_gallus.Galgal4.73.gtf from Ensembl. As I run
>>> into the problem Error in .doBasicCount(obj) : The genomicAnnotation slot
>>> is empty, I modified the file and added chr before the chromosome number.
>>> The next problem was this:
>>>>>>> 
>>>>>>> Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all
>>> the required fields: gene_id, transcript_id, exon_number, gene_name.
>>>>>>> 
>>>>>>> To solve this problem:
>>>>>>> - I deleted all the entries without gene_name (first example):
>>>>>>> 
>>>>>>> gene_id "ENSGALG00000009771"; transcript_id "ENSGALT00000015891";
>>> exon_number "1"; gene_biotype "protein_coding"; exon_id
>>> "ENSGALE00000301221";
>>>>>>> 
>>>>>>> gene_id "ENSGALG00000009783"; transcript_id "ENSGALT00000015914";
>>> exon_number "2"; gene_name "GOLGB1"; gene_biotype "protein_coding";
>>> transcript_name "GOLGB1-201"; exon_id "ENSGALE00000105891";
>>>>>>> 
>>>>>>> - I checked the chromosome numbers and deleted the entries that
>>> didn’t match any chromosome from BSgenome.Ggallus.UCSC.galGal4 (I can’t
>>> find any entry corresponding to chr32 in the Gallus_gallus.Galgal4.73.gtf
>>> file, I don’t know if it is a problem):
>>>>>>> 
>>>>>>> - I searched for semicolons and single quotes ‘ in the gene names,
>>> but I didn’t find any on the final file.
>>>>>>> 
>>>>>>> - I deleted all the columns after gene_name.
>>>>>>> 
>>>>>>> So finally the annotation file entries look like this:
>>>>>>> chr1 protein_coding  exon    19962541        19963992
>>>>>>>      .       +       .       gene_id "ENSGALG00000000003";
>>> transcript_id "ENSGALT00000000003"; exon_number "2"; gene_name "PANX2";
>>>>>>> 
>>>>>>> Nothing works; the error message is always the same. So, I don’t
>>> know what else I can do. Could you please help me?
>>>>>>> Thank you in advance!
>>>>>>> Cheers
>>>>>>> 
>>>>>>> Natalia
>>>>>>> 
>>>>>>> 
>>>>>>> here is my code:
>>>>>>>> count.table <- easyRNASeq("/RNAseqGallus", organism="Ggallus",
>>> chrSizes="chrSizes", annotationMethod="gtf",
>>> annotationFile="Gallus_gallus.Galgal4.73.gtf", count="genes",
>>> summarization="geneModels", format="bam", gapped=TRUE,
>>> filenames=c("NS1gallus.bam","NS2gallus.bam"), outputFormat="DESeq",
>>> conditions=conditions)
>>>>>>> Checking arguments...
>>>>>>> Fetching annotations...
>>>>>>> Read 334620 records
>>>>>>> Error en .getGtfRange(organismName(obj), filename = filename,
>>> ignoreWarnings = ignoreWarnings,  :
>>>>>>> Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all
>>> the required fields: gene_id, transcript_id, exon_number, gene_name.
>>>>>>> Además: Mensajes de aviso perdidos
>>>>>>> 1: In easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes =
>>> "chrSizes",  :
>>>>>>> Your organism has no mapping defined to perform the validity check
>>> for the UCSC compliance of the chromosome name.
>>>>>>> Defined organism's mapping can be listed using the 'knownOrganisms'
>>> function.
>>>>>>> To benefit from the validity check, you can provide a 'chr.map' to
>>> your 'easyRNASeq' function call.
>>>>>>> As you did not do so, 'validity.check' is turned off
>>>>>>> 2: In .Method(..., deparse.level = deparse.level) :
>>>>>>> number of columns of result is not a multiple of vector length (arg
>>> 1)
>>>>>>> 
>>>>>>>> traceback()
>>>>>>> 6: stop(paste("Your gtf file: ", filename, " does not contain all
>>> the required fields: ",
>>>>>>>    paste(fields, collapse = ", "), ".", sep = ""))
>>>>>>> 5: .getGtfRange(organismName(obj), filename = filename,
>>> ignoreWarnings = ignoreWarnings,
>>>>>>>    ...)
>>>>>>> 4: fetchAnnotation(obj, method = annotationMethod, filename =
>>> annotationFile,
>>>>>>>    ignoreWarnings = ignoreWarnings, ...)
>>>>>>> 3: fetchAnnotation(obj, method = annotationMethod, filename =
>>> annotationFile,
>>>>>>>    ignoreWarnings = ignoreWarnings, ...)
>>>>>>> 2: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes =
>>> "chrSizes",
>>>>>>>    annotationMethod = "gtf", annotationFile = "
>>> Gallus_gallus.Galgal4.73.gtf ",
>>>>>>>    count = "genes", summarization = "geneModels", format = "bam",
>>>>>>>    gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"),
>>>>>>>    outputFormat = "DESeq", conditions = conditions)
>>>>>>> 1: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes =
>>> "chrSizes",
>>>>>>>    annotationMethod = "gtf", annotationFile = "
>>> Gallus_gallus.Galgal4.73.gtf ",
>>>>>>>    count = "genes", summarization = "geneModels", format = "bam",
>>>>>>>    gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"),
>>>>>>>    outputFormat = "DESeq", conditions = conditions)
>>>>>>> 
>>>>>>> 
>>>>>>> -- output of sessionInfo():
>>>>>>> 
>>>>>>>> sessionInfo()
>>>>>>> R version 3.0.2 (2013-09-25)
>>>>>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>>>>> 
>>>>>>> locale:
>>>>>>> [1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252
>>>  LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C
>>> LC_TIME=Spanish_Spain.1252
>>>>>>> 
>>>>>>> attached base packages:
>>>>>>> [1] parallel  stats     graphics  grDevices utils     datasets
>>>  methods   base
>>>>>>> 
>>>>>>> other attached packages:
>>>>>>> [1] BSgenome.Ggallus.UCSC.galGal4_1.3.18 BSgenome_1.30.0
>>>            easyRNASeq_1.8.1                     ShortRead_1.20.0
>>>>>>> [5] Rsamtools_1.14.1                     GenomicRanges_1.14.3
>>>           DESeq_1.14.0                         lattice_0.20-23
>>>>>>> [9] locfit_1.5-9.1                       Biostrings_2.30.0
>>>            XVector_0.2.0                        IRanges_1.20.4
>>>>>>> [13] edgeR_3.4.0                          limma_3.18.2
>>>             biomaRt_2.18.0                       Biobase_2.22.0
>>>>>>> [17] genomeIntervals_1.18.0               BiocGenerics_0.8.0
>>>             intervals_0.14.0                     BiocInstaller_1.12.0
>>>>>>> 
>>>>>>> loaded via a namespace (and not attached):
>>>>>>> [1] annotate_1.40.0      AnnotationDbi_1.24.0 bitops_1.0-6
>>> DBI_0.2-7            genefilter_1.44.0    geneplotter_1.40.0   grid_3.0.2
>>>>>>> [8] hwriter_1.3          latticeExtra_0.6-26  LSD_2.5
>>>  RColorBrewer_1.0-5   RCurl_1.95-4.1       RSQLite_0.11.4
>>> splines_3.0.2
>>>>>>> [15] stats4_3.0.2         survival_2.37-4      tools_3.0.2
>>>  XML_3.98-1.1         xtable_1.7-1         zlibbioc_1.8.0
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Sent via the guest posting facility at bioconductor.org.
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>> 
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 	[[alternative HTML version deleted]]
>> 
>> 
>> 
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> 
> 
> 
> -- 
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
> 
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793