[Bioc-devel] [BioC] GTF file error when using easyRNAseq
Nicolas Delhomme
delhomme at embl.de
Sat Nov 16 23:34:10 CET 2013
Hej Martin!
That is indeed extremely useful. I’ll add that to my vignette as a source for annotation. One thing that comes immediately to mind is to have AnnotationHub and GenomicFeatures interact. I could imagine it would be useful to have e.g. a makeTranscriptDbFromAnnotationHub function. I admit I haven’t though about this in detail, but I already can think of use cases where such a function would come in handy.
Nico
---------------------------------------------------------------
Nicolas Delhomme
Genome Biology Computational Support
European Molecular Biology Laboratory
Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------
On 15 Nov 2013, at 20:13, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 11/15/2013 10:22 AM, Michael Lawrence wrote:
>> Doesn't look like genomeIntervals has any C code (?), so a performance
>> comparison would be interesting. rtracklayer jumps through all sorts of
>> hoops to handle obscure things like URL encoding in GFF3. The code in
>> genomeIntervals seems more streamlined.
>
> Wanted to mention, and it would be good to know if this was not helpful at all, that the Ensembl gtf files are available through AnnotationHub as GRanges objects
>
> > library(AnnotationHub)
> > hub = AnnotationHub()
> > hub$ensembl.release.73.<tab>
> hub$ensembl.release.73.fasta. ... [378]
> hub$ensembl.release.73.gtf. ... [63]
> > xx = hub$ensembl.release.73.gtf.gallus_gallus.Gallus_gallus.Galgal4.73.gtf_0.0.1.RData
> > xx
> GRanges with 381368 ranges and 12 metadata columns:
> seqnames ranges strand | source type
> <Rle> <IRanges> <Rle> | <factor> <factor>
> [1] 1 [1735, 2449] + | protein_coding exon
> [2] 1 [2379, 2449] + | protein_coding CDS
> score phase gene_id transcript_id
> <numeric> <integer> <character> <character>
> [1] <NA> <NA> ENSGALG00000009771 ENSGALT00000015891
> [2] <NA> 0 ENSGALG00000009771 ENSGALT00000015891
> exon_number gene_biotype exon_id protein_id
> <numeric> <character> <character> <character>
> [1] 1 protein_coding ENSGALE00000301221 <NA>
> [2] 1 protein_coding <NA> ENSGALP00000015874
> gene_name transcript_name
> <character> <character>
> [1] <NA> <NA>
> [2] <NA> <NA>
> [ reached getOption("max.print") -- omitted 9 rows ]
> ---
> seqlengths:
> 1 2 ... AADN03010940.1
> NA NA ... NA
>
> Martin
>
>>
>>
>>
>>
>>
>>
>> On Fri, Nov 15, 2013 at 10:14 AM, Nicolas Delhomme <delhomme at embl.de> wrote:
>>
>>> Took that thread to the devel list, just feels more appropriate with
>>> regards to the content.
>>>
>>> I already have that on my TODO list :-). This is not up-to-date, i.e. I
>>> haven’t done the comparison in ~2 years, but last time I did,
>>> genomeIntervals attribute parsing was faster than rtracklayer equivalent. I
>>> suppose that’s because it is already implemented in C in genomeIntervals.
>>> As said I don’t have any actual comparative numbers, still you might want
>>> to have a look at the genomeIntervals code. As I don’t think that
>>> genomeIntervals get as much exposition as rtracklayer does, many more
>>> people would benefit from an equivalent rtracklayer implementation. If
>>> you’re interested, I could do a performance comparison - based on my usual
>>> use case - between both packages.
>>>
>>> Nico
>>>
>>> ---------------------------------------------------------------
>>> Nicolas Delhomme
>>>
>>> Genome Biology Computational Support
>>>
>>> European Molecular Biology Laboratory
>>>
>>> Tel: +49 6221 387 8310
>>> Email: nicolas.delhomme at embl.de
>>> Meyerhofstrasse 1 - Postfach 10.2209
>>> 69102 Heidelberg, Germany
>>> ---------------------------------------------------------------
>>>
>>>
>>>
>>>
>>>
>>> On 15 Nov 2013, at 18:58, Michael Lawrence <lawrence.michael at gene.com>
>>> wrote:
>>>
>>>> It might be worth taking a look at rtracklayer and the TranscriptDb
>>> stuff in GenomicFeatures. It could save you time, and if you notice any
>>> deficiencies in rtracklayer, it would help me. For example, if the
>>> attribute parsing is a bottleneck, I can push it down to C.
>>>>
>>>> Michael
>>>>
>>>> On Fri, Nov 15, 2013 at 8:23 AM, Nicolas Delhomme <delhomme at embl.de>
>>> wrote:
>>>> Hej Michael,
>>>>
>>>> Good question really. I have a number of reason for this:
>>>>
>>>> 1) I’ve been using the genomeIntervals readGff3 function for that - for
>>> years now - and I’ve always been satisfied by its performance, especially
>>> when parsing the gff/gtf ninth column. The parseGffAttribute and
>>> getGffAttribute functions are extremely convenient. I honestly haven’t
>>> checked if there was any recent development in rtracklayer /
>>> GenomicFeatures similar to these functions. If there were not, I think they
>>> would be a great addition to either package.
>>>>
>>>> 2) As you might guess it’s essentially historical, back when I started
>>> that package in 2009, there was not today’s fantastic set of packages.
>>>>
>>>> 3) As you painfully know, there’s about as many gff format as they are
>>> gff files, and because my package is a pipeline I really want to make sure
>>> that it’s output is consistent, hence I have strict requirement with
>>> regards to the gff/gtf format I accept. Which means that times and again, I
>>> have to do slight adjustment but I prefer that over outputting garbage.
>>>>
>>>> 4) RNA-Seq analyses are filled with pitfalls, hence I think it is
>>> essential that users understand the data formats they handle and actually
>>> what these analyses are all about. I don’t want them to use my package as
>>> they would use a black box.
>>>>
>>>> 5) It’s educational. There’s a vignette that describes how to parse and
>>> convert gff/gtf annotation in the minimal gff/gtf formatted file that would
>>> suit my package
>>>>
>>>> Well, I suppose it’s more than you asked for, but here are my reasons
>>> ;-) You’re welcome to comment and I’d be happy to look again at rtracklayer
>>> (been through GenomicFeatures recently and I like it much) if you would
>>> advise me so.
>>>>
>>>> Have a nice WE,
>>>>
>>>> Cheers,
>>>>
>>>> Nico
>>>>
>>>>
>>>> ---------------------------------------------------------------
>>>> Nicolas Delhomme
>>>>
>>>> Genome Biology Computational Support
>>>>
>>>> European Molecular Biology Laboratory
>>>>
>>>> Tel: +49 6221 387 8310
>>>> Email: nicolas.delhomme at embl.de
>>>> Meyerhofstrasse 1 - Postfach 10.2209
>>>> 69102 Heidelberg, Germany
>>>> ---------------------------------------------------------------
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 15 Nov 2013, at 12:44, Michael Lawrence <lawrence.michael at gene.com>
>>> wrote:
>>>>
>>>>> Why not use rtracklayer / GenomicFeatures for parsing GTF? That format
>>> is tough; no reason for everyone to take it on by themselves.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Nov 15, 2013 at 2:40 AM, Nicolas Delhomme <delhomme at embl.de>
>>> wrote:
>>>>> Hej Natalia!
>>>>>
>>>>> There were a number of lines in that particular gtf that violated the
>>> assumptions I had about EnsEMBL gtf. Not all the fields in the attributes'
>>> column were always set and one of the gene name had a space character in
>>> it. I’ve made the parsing of gtf file annotation more flexible/lenient and
>>> that should resolve that particular issue you had. The changes should
>>> propagate in ~2 days to Bioc with easyRNASeq version 1.8.2.
>>>>>
>>>>> Rather than using the geneModel, which implementation is old and has
>>> gotten slow because of changes in the underlying architecture, I prefer an
>>> approach where I
>>>>> 1) filter the gtf / gff annotation file for only those lines I’m
>>> interested in (e.g. of type exon, mRNA and gene for a gff file)
>>>>> 2) collapse every exon of a gene into what I call now a “synthetic
>>> transcript”. The reason for changing the naming from geneModel to synthetic
>>> transcript is that “gene model” has different meaning depending on the
>>> field of application.
>>>>> 3) use easyRNASeq with the modified annotation and the “transcript”
>>> count method.
>>>>>
>>>>> I’ve detailed that procedure in the vignette section 7.1 .
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Nico
>>>>>
>>>>> ---------------------------------------------------------------
>>>>> Nicolas Delhomme
>>>>>
>>>>> Genome Biology Computational Support
>>>>>
>>>>> European Molecular Biology Laboratory
>>>>>
>>>>> Tel: +49 6221 387 8310
>>>>> Email: nicolas.delhomme at embl.de
>>>>> Meyerhofstrasse 1 - Postfach 10.2209
>>>>> 69102 Heidelberg, Germany
>>>>> ---------------------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 12 Nov 2013, at 14:21, Nicolas Delhomme <delhomme at embl.de> wrote:
>>>>>
>>>>>> Hej Natalia!
>>>>>>
>>>>>> This is not the first time that I’ve seen this error on the list,
>>> but I’ve not been able to reproduce it so far with my own data. Would you
>>> mind sharing some data offline, just an excerpt of your files would do. If
>>> that’s OK, I can create and give access to a folder on my box account.
>>>>>>
>>>>>> I had already relaxed the constraint on parsing a gtf file in a
>>> previous update but forgot to reflect the changes in the error message.
>>> Only the gene_id and transcript_id are actually mandatory. I would not
>>> expect any issue with the EnsEMBL gtf file, but I’ll have a look at why it
>>> fails for Gallus galls one and let you know asap.
>>>>>>
>>>>>> This as nothing to do with this error, but by looking at your
>>> command line, I saw that you provide a character vector to the chrSize
>>> argument. This is not necessary as this information is extracted from the
>>> bam file in your case, so you can just drop the chrSizes = “chrSizes” from
>>> your command line. I’ve added some extra check in the method to detect this
>>> now. Thanks :-)
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Nico
>>>>>>
>>>>>> ---------------------------------------------------------------
>>>>>> Nicolas Delhomme
>>>>>>
>>>>>> Genome Biology Computational Support
>>>>>>
>>>>>> European Molecular Biology Laboratory
>>>>>>
>>>>>> Tel: +49 6221 387 8310
>>>>>> Email: nicolas.delhomme at embl.de
>>>>>> Meyerhofstrasse 1 - Postfach 10.2209
>>>>>> 69102 Heidelberg, Germany
>>>>>> ---------------------------------------------------------------
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 11 Nov 2013, at 16:31, Natalia [guest] <guest at bioconductor.org>
>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> Dear all,
>>>>>>> I would like to make a count table to use it in DESeq. I´ve tried
>>> to use easyRNAseq but I have a problem with the annotation file. I’ve
>>> downloaded the file Gallus_gallus.Galgal4.73.gtf from Ensembl. As I run
>>> into the problem Error in .doBasicCount(obj) : The genomicAnnotation slot
>>> is empty, I modified the file and added chr before the chromosome number.
>>> The next problem was this:
>>>>>>>
>>>>>>> Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all
>>> the required fields: gene_id, transcript_id, exon_number, gene_name.
>>>>>>>
>>>>>>> To solve this problem:
>>>>>>> - I deleted all the entries without gene_name (first example):
>>>>>>>
>>>>>>> gene_id "ENSGALG00000009771"; transcript_id "ENSGALT00000015891";
>>> exon_number "1"; gene_biotype "protein_coding"; exon_id
>>> "ENSGALE00000301221";
>>>>>>>
>>>>>>> gene_id "ENSGALG00000009783"; transcript_id "ENSGALT00000015914";
>>> exon_number "2"; gene_name "GOLGB1"; gene_biotype "protein_coding";
>>> transcript_name "GOLGB1-201"; exon_id "ENSGALE00000105891";
>>>>>>>
>>>>>>> - I checked the chromosome numbers and deleted the entries that
>>> didn’t match any chromosome from BSgenome.Ggallus.UCSC.galGal4 (I can’t
>>> find any entry corresponding to chr32 in the Gallus_gallus.Galgal4.73.gtf
>>> file, I don’t know if it is a problem):
>>>>>>>
>>>>>>> - I searched for semicolons and single quotes ‘ in the gene names,
>>> but I didn’t find any on the final file.
>>>>>>>
>>>>>>> - I deleted all the columns after gene_name.
>>>>>>>
>>>>>>> So finally the annotation file entries look like this:
>>>>>>> chr1 protein_coding exon 19962541 19963992
>>>>>>> . + . gene_id "ENSGALG00000000003";
>>> transcript_id "ENSGALT00000000003"; exon_number "2"; gene_name "PANX2";
>>>>>>>
>>>>>>> Nothing works; the error message is always the same. So, I don’t
>>> know what else I can do. Could you please help me?
>>>>>>> Thank you in advance!
>>>>>>> Cheers
>>>>>>>
>>>>>>> Natalia
>>>>>>>
>>>>>>>
>>>>>>> here is my code:
>>>>>>>> count.table <- easyRNASeq("/RNAseqGallus", organism="Ggallus",
>>> chrSizes="chrSizes", annotationMethod="gtf",
>>> annotationFile="Gallus_gallus.Galgal4.73.gtf", count="genes",
>>> summarization="geneModels", format="bam", gapped=TRUE,
>>> filenames=c("NS1gallus.bam","NS2gallus.bam"), outputFormat="DESeq",
>>> conditions=conditions)
>>>>>>> Checking arguments...
>>>>>>> Fetching annotations...
>>>>>>> Read 334620 records
>>>>>>> Error en .getGtfRange(organismName(obj), filename = filename,
>>> ignoreWarnings = ignoreWarnings, :
>>>>>>> Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all
>>> the required fields: gene_id, transcript_id, exon_number, gene_name.
>>>>>>> Además: Mensajes de aviso perdidos
>>>>>>> 1: In easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes =
>>> "chrSizes", :
>>>>>>> Your organism has no mapping defined to perform the validity check
>>> for the UCSC compliance of the chromosome name.
>>>>>>> Defined organism's mapping can be listed using the 'knownOrganisms'
>>> function.
>>>>>>> To benefit from the validity check, you can provide a 'chr.map' to
>>> your 'easyRNASeq' function call.
>>>>>>> As you did not do so, 'validity.check' is turned off
>>>>>>> 2: In .Method(..., deparse.level = deparse.level) :
>>>>>>> number of columns of result is not a multiple of vector length (arg
>>> 1)
>>>>>>>
>>>>>>>> traceback()
>>>>>>> 6: stop(paste("Your gtf file: ", filename, " does not contain all
>>> the required fields: ",
>>>>>>> paste(fields, collapse = ", "), ".", sep = ""))
>>>>>>> 5: .getGtfRange(organismName(obj), filename = filename,
>>> ignoreWarnings = ignoreWarnings,
>>>>>>> ...)
>>>>>>> 4: fetchAnnotation(obj, method = annotationMethod, filename =
>>> annotationFile,
>>>>>>> ignoreWarnings = ignoreWarnings, ...)
>>>>>>> 3: fetchAnnotation(obj, method = annotationMethod, filename =
>>> annotationFile,
>>>>>>> ignoreWarnings = ignoreWarnings, ...)
>>>>>>> 2: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes =
>>> "chrSizes",
>>>>>>> annotationMethod = "gtf", annotationFile = "
>>> Gallus_gallus.Galgal4.73.gtf ",
>>>>>>> count = "genes", summarization = "geneModels", format = "bam",
>>>>>>> gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"),
>>>>>>> outputFormat = "DESeq", conditions = conditions)
>>>>>>> 1: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes =
>>> "chrSizes",
>>>>>>> annotationMethod = "gtf", annotationFile = "
>>> Gallus_gallus.Galgal4.73.gtf ",
>>>>>>> count = "genes", summarization = "geneModels", format = "bam",
>>>>>>> gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"),
>>>>>>> outputFormat = "DESeq", conditions = conditions)
>>>>>>>
>>>>>>>
>>>>>>> -- output of sessionInfo():
>>>>>>>
>>>>>>>> sessionInfo()
>>>>>>> R version 3.0.2 (2013-09-25)
>>>>>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>>>>>
>>>>>>> locale:
>>>>>>> [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252
>>> LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C
>>> LC_TIME=Spanish_Spain.1252
>>>>>>>
>>>>>>> attached base packages:
>>>>>>> [1] parallel stats graphics grDevices utils datasets
>>> methods base
>>>>>>>
>>>>>>> other attached packages:
>>>>>>> [1] BSgenome.Ggallus.UCSC.galGal4_1.3.18 BSgenome_1.30.0
>>> easyRNASeq_1.8.1 ShortRead_1.20.0
>>>>>>> [5] Rsamtools_1.14.1 GenomicRanges_1.14.3
>>> DESeq_1.14.0 lattice_0.20-23
>>>>>>> [9] locfit_1.5-9.1 Biostrings_2.30.0
>>> XVector_0.2.0 IRanges_1.20.4
>>>>>>> [13] edgeR_3.4.0 limma_3.18.2
>>> biomaRt_2.18.0 Biobase_2.22.0
>>>>>>> [17] genomeIntervals_1.18.0 BiocGenerics_0.8.0
>>> intervals_0.14.0 BiocInstaller_1.12.0
>>>>>>>
>>>>>>> loaded via a namespace (and not attached):
>>>>>>> [1] annotate_1.40.0 AnnotationDbi_1.24.0 bitops_1.0-6
>>> DBI_0.2-7 genefilter_1.44.0 geneplotter_1.40.0 grid_3.0.2
>>>>>>> [8] hwriter_1.3 latticeExtra_0.6-26 LSD_2.5
>>> RColorBrewer_1.0-5 RCurl_1.95-4.1 RSQLite_0.11.4
>>> splines_3.0.2
>>>>>>> [15] stats4_3.0.2 survival_2.37-4 tools_3.0.2
>>> XML_3.98-1.1 xtable_1.7-1 zlibbioc_1.8.0
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Sent via the guest posting facility at bioconductor.org.
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>
>>>>
>>>
>>>
>>
>> [[alternative HTML version deleted]]
>>
>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
More information about the Bioc-devel
mailing list