[Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

Martin Morgan mtmorgan at fredhutch.org
Wed Jun 10 14:08:52 CEST 2015


On 06/10/2015 01:11 AM, Rainer Johannes wrote:
> Dear Martin,
>
> the AnnotationHub approach looks awesome! However, somehow it does not work for
> me, I always get an error:
>
> library(AnnotationHub)
> library(Rsamtools)
> library(GenomicFeatures)
> ah <- AnnotationHub()
>
> ## somehow I don't see DNA sequences for release-80... thus using 75
> query(ah, c("Takifugu", "release-75"))
>
> ## load the gtf and the dna.toplevel.fa
> gtf <- ah[["AH10717"]]
> dna <- ah[["AH20637"]]
>
> ## create txdb and get exons:
> txdb <- makeTxDbFromGRanges(gtf)
> exons <- exons(txdb)
>
> ## getting the sequences
> getSeq(dna, exons)
>
> and I get the following error:
> Error in value[[3L]](cond) : 'open' index failed
>    file: /Users/jo/~/.AnnotationHub/24732
> In addition: Warning message:
> In doTryCatch(return(expr), name, parentenv, handler) :
>    [fai_load] fail to open FASTA index.
>

Thanks Rainer --

The release-80 fasta files should have been made available to Bioc 3.1, but are 
only available in devel. This will be fixed later today.

At least some of the fasta files from earlier Ensembl releases seem to be 
missing an index, as you correctly diagnose. We'll look into this, too... I 
think the workaround is

   library(Rsamtools)
   indexFa(dna)
   dna = FaFile(path(dna))

Martin

> ## creating the index
> indexFa(dna)
>
> ## trying again:
> getSeq(dna, exons)
> Error in value[[3L]](cond) :  record 1 (MT:1-68) was truncated
>    file: /Users/jo/~/.AnnotationHub/24732
>
> This one is from the current R-devel, but I get the same error with the stable
> version... any idea?
>
> cheers, jo
>
>
> my session info:
>  > sessionInfo()
> R Under development (unstable) (2015-06-06 r68485)
> Platform: x86_64-apple-darwin14.4.0/x86_64 (64-bit)
> Running under: OS X 10.10.4 (Yosemite)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] parallel  stats4    stats     graphics  grDevices utils     datasets
> [8] methods   base
>
> other attached packages:
>   [1] GenomicFeatures_1.20.1 AnnotationDbi_1.30.1   Biobase_2.28.0
>   [4] Rsamtools_1.20.4       Biostrings_2.36.1      XVector_0.8.0
>   [7] GenomicRanges_1.20.4   GenomeInfoDb_1.4.0     IRanges_2.2.3
> [10] S4Vectors_0.6.0        BiocGenerics_0.14.0    AnnotationHub_2.0.2
>
> loaded via a namespace (and not attached):
>   [1] Rcpp_0.11.6                  magrittr_1.5
>   [3] GenomicAlignments_1.4.1      zlibbioc_1.14.0
>   [5] BiocParallel_1.2.2           xtable_1.7-4
>   [7] R6_2.0.1                     stringr_1.0.0
>   [9] httr_0.6.1                   tools_3.3.0
> [11] DBI_0.3.1                    lambda.r_1.1.7
> [13] futile.logger_1.4.1          htmltools_0.2.6
> [15] digest_0.6.8                 interactiveDisplayBase_1.6.0
> [17] shiny_0.12.0                 rtracklayer_1.28.4
> [19] futile.options_1.0.0         bitops_1.0-6
> [21] biomaRt_2.24.0               RCurl_1.95-4.6
> [23] RSQLite_1.0.0                mime_0.3
> [25] stringi_0.4-1                BiocInstaller_1.18.2
> [27] XML_3.98-1.2                 httpuv_1.3.2
>
>
>> On 09 Jun 2015, at 16:25, Martin Morgan <mtmorgan at fredhutch.org
>> <mailto:mtmorgan at fredhutch.org>> wrote:
>>
>> On 06/08/2015 11:43 PM, Rainer Johannes wrote:
>>> dear Robert and Ludwig,
>>>
>>> the EnsDb packages provide all the gene/transcript etc annotations for all
>>> genes defined in the Ensembl database (for a given species and Ensembl
>>> release). Except the column/attribute "entrezid" that is stored in the
>>> internal database there is however no link to NCBI or UCSC annotations. So,
>>> basically, if you want to use "pure" Ensembl based annotations: use EnsDb, if
>>> you want to have the UCSC annotations: use the TxDb packages.
>>>
>>> In case you need EnsDbs of other species or Ensembl versions, the ensembldb
>>> package provides functionality to generate such packages either using the
>>> Ensembl Perl API or using GTF files provided by Ensembl. If you have problems
>>> building the packages, just drop me a line and I'll do that.
>>
>> Two other sources of Ensembl TxDb's are GenomicFeatures::makeTxDbFromBiomart()
>> and AnnotationHub. For the latter, I'll add a variant of the following to the
>> AnnotationHub HOWTO
>> vignettehttp://bioconductor.org/packages/devel/bioc/html/AnnotationHub.htmllater
>> today.
>>
>> ## Gene models
>>
>> _Bioconductor_ represents gene models using 'transcript'
>> databases. These are available via packages such as
>> [TxDb.Hsapiens.UCSC.hg38.knownGene](http://bioconductor.org/packages/TxDb.Hsapiens.UCSC.knownGene.html),
>> or can be constructed using functions such as
>> `[GenomicFeatures](http://bioconductor.org/packages/GenomicFeatures.html)::makeTxDbFromBiomart()`
>> or `GenomicFeatures::makeTxDbFromGRanges()`.
>>
>> _AnnotationHub_ provides an easy way to work with gene models
>> published by Ensembl. Here we discover the Ensemble release 80 r
>> esources for pufferfish,_Takifugu rubripes_
>>
>> ```{r takifugu-gene-models}
>> query(ah, c("Takifugu", "release-80"))
>> ```
>>
>> We see that there is a GTF file, as well as various DNA
>> sequences. Let's retrieve the GTF and top-level sequence files. The
>> GTF file is imported as a _GRanges_ instance, the DNA sequence as a
>> compressed, indexed Fasta file
>>
>>
>> ```{r takifugi-data}
>> gtf <- ah[["AH47101"]]
>> dna <- ah[["AH47477"]]
>>
>> head(gtf, 3)
>> dna
>> head(seqlevels(dna))
>> ```
>>
>> It is trivial to make a TxDb instance
>>
>> ```{r takifugi-txdb}
>> library(GenomicFeatures)
>> txdb <- makeTxDbFromGRanges(gtf)
>> ````
>>
>> and to use that in conjunction with the DNA sequence, e.g., to find
>> exon sequences of all annotated genes.
>>
>> ```{r takifugi-exons}
>> library(Rsamtools)     # for getSeq,FaFile-method
>> exons <- exons(txdb)
>> getSeq(dna, exons)
>> ```
>>
>> Some difficulties arise when working with this partly assembled genome
>> that require more advanced GenomicRanges skills, see the
>> [GenomicRanges](http://bioconductor.org/packages/GenomicRanges.html)
>> vignettes, especially "GenomicRanges HOWTOs" and "An Introduction to
>> GenomicRanges".
>>
>>
>>>
>>> cheers, jo
>>>
>>>> On 03 Jun 2015, at 15:56, Robert M. Flight <rflight79 at gmail.com
>>>> <mailto:rflight79 at gmail.com>> wrote:
>>>>
>>>> Ludwig,
>>>>
>>>> If you do this search on the UCSC genome browser (which this annotation
>>>> package is built from), you will see that the longest variant is what is
>>>> shown
>>>>
>>>> http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammal&org=Human&db=hg38&position=brca1&hgt.positionInput=brca1&hgt.suggestTrack=knownGene&Submit=submit&hgsid=429339723_8sd4QD2jSAnAsa6cVCevtoOy4GAz&pix=1885
>>>>
>>>>
>>>>
>> If instead of "genes" you do "transcripts", you will see 20 different
>>>> transcripts for this gene, including the one listed by NCBI.
>>>>
>>>> I havent tried it yet (haven't upgraded R or bioconductor to latest
>>>> version), but there is now an Ensembl based annotation package as well,
>>>> that may work better??
>>>> http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html
>>>>
>>>>
>>>>
>> -Robert
>>>>
>>>>
>>>>
>>>> On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger <
>>>> Ludwig.Geistlinger at bio.ifi.lmu.de
>>>> <mailto:Ludwig.Geistlinger at bio.ifi.lmu.de>> wrote:
>>>>
>>>>> Dear Bioc annotation team,
>>>>>
>>>>> Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g.
>>>>> for
>>>>>
>>>>> BRCA1; ENSG00000012048; entrez:672
>>>>>
>>>>> via
>>>>>
>>>>>> genes(TxDb.Hsapiens.UCSC.hg38.knownGene, vals=list(gene_id="672"))
>>>>>
>>>>> gives me:
>>>>>
>>>>> GRanges object with 1 range and 1 metadata column: seqnames
>>>>> ranges strand |     gene_id <Rle>            <IRanges>  <Rle> |
>>>>> <character> 672    chr17 [43044295, 43170403]      - |         672
>>>>> ------- seqinfo: 455 sequences (1 circular) from hg38 genome
>>>>>
>>>>>
>>>>> However, querying Ensembl and NCBI Gene
>>>>> http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000012048
>>>>>
>>>>>
>> http://www.ncbi.nlm.nih.gov/gene/672
>>>>>
>>>>> the gene is located at (note the difference in the end position)
>>>>>
>>>>> Chromosome 17: 43,044,295-43,125,483 reverse strand
>>>>>
>>>>>
>>>>> How is the inconsistency explained and how to extract an ENSEMBL/NCBI
>>>>> conform annotation from the TxDb object? (I am aware of biomaRt, but I
>>>>> want to explicitely use the Bioc annotation functionality).
>>>>>
>>>>> Thanks! Ludwig
>>>>>
>>>>>
>>>>> -- Dipl.-Bioinf. Ludwig Geistlinger
>>>>>
>>>>> Lehr- und Forschungseinheit für Bioinformatik Institut für Informatik
>>>>> Ludwig-Maximilians-Universität München Amalienstrasse 17, 2. Stock, Büro
>>>>> A201 80333 München
>>>>>
>>>>> Tel.: 089-2180-4067 eMail: Ludwig.Geistlinger at bio.ifi.lmu.de
>>>>> <mailto:Ludwig.Geistlinger at bio.ifi.lmu.de>
>>>>>
>>>>> _______________________________________________ Bioc-devel at r-project.org
>>>>> <mailto:Bioc-devel at r-project.org>
>>>>> mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________ Bioc-devel at r-project.org
>>>> <mailto:Bioc-devel at r-project.org>
>>>> mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>> _______________________________________________Bioc-devel at r-project.org
>>> <mailto:Bioc-devel at r-project.org>
>>> mailing listhttps://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>>
>> --
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list