[Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

Rainer Johannes Johannes.Rainer at eurac.edu
Tue Jun 9 08:43:36 CEST 2015


dear Robert and Ludwig,

the EnsDb packages provide all the gene/transcript etc annotations for all genes defined in the Ensembl database (for a given species and Ensembl release). Except the column/attribute "entrezid" that is stored in the internal database there is however no link to NCBI or UCSC annotations.
So, basically, if you want to use "pure" Ensembl based annotations: use EnsDb, if you want to have the UCSC annotations: use the TxDb packages.

In case you need EnsDbs of other species or Ensembl versions, the ensembldb package provides functionality to generate such packages either using the Ensembl Perl API or using GTF files provided by Ensembl. If you have problems building the packages, just drop me a line and I'll do that.

cheers, jo

> On 03 Jun 2015, at 15:56, Robert M. Flight <rflight79 at gmail.com> wrote:
> 
> Ludwig,
> 
> If you do this search on the UCSC genome browser (which this annotation
> package is built from), you will see that the longest variant is what is
> shown
> 
> http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammal&org=Human&db=hg38&position=brca1&hgt.positionInput=brca1&hgt.suggestTrack=knownGene&Submit=submit&hgsid=429339723_8sd4QD2jSAnAsa6cVCevtoOy4GAz&pix=1885
> 
> If instead of "genes" you do "transcripts", you will see 20 different
> transcripts for this gene, including the one listed by NCBI.
> 
> I havent tried it yet (haven't upgraded R or bioconductor to latest
> version), but there is now an Ensembl based annotation package as well,
> that may work better??
> http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html
> 
> -Robert
> 
> 
> 
> On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger <
> Ludwig.Geistlinger at bio.ifi.lmu.de> wrote:
> 
>> Dear Bioc annotation team,
>> 
>> Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g. for
>> 
>> BRCA1; ENSG00000012048; entrez:672
>> 
>> via
>> 
>>> genes(TxDb.Hsapiens.UCSC.hg38.knownGene, vals=list(gene_id="672"))
>> 
>> gives me:
>> 
>> GRanges object with 1 range and 1 metadata column:
>>      seqnames               ranges strand |     gene_id
>>         <Rle>            <IRanges>  <Rle> | <character>
>>  672    chr17 [43044295, 43170403]      - |         672
>>  -------
>>  seqinfo: 455 sequences (1 circular) from hg38 genome
>> 
>> 
>> However, querying Ensembl and NCBI Gene
>> http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000012048
>> http://www.ncbi.nlm.nih.gov/gene/672
>> 
>> the gene is located at (note the difference in the end position)
>> 
>> Chromosome 17: 43,044,295-43,125,483 reverse strand
>> 
>> 
>> How is the inconsistency explained and how to extract an ENSEMBL/NCBI
>> conform annotation from the TxDb object?
>> (I am aware of biomaRt, but I want to explicitely use the Bioc annotation
>> functionality).
>> 
>> Thanks!
>> Ludwig
>> 
>> 
>> --
>> Dipl.-Bioinf. Ludwig Geistlinger
>> 
>> Lehr- und Forschungseinheit für Bioinformatik
>> Institut für Informatik
>> Ludwig-Maximilians-Universität München
>> Amalienstrasse 17, 2. Stock, Büro A201
>> 80333 München
>> 
>> Tel.: 089-2180-4067
>> eMail: Ludwig.Geistlinger at bio.ifi.lmu.de
>> 
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> 
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel



More information about the Bioc-devel mailing list