[Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

Wed Jun 10 10:29:32 CEST 2015

Dear Johannes,

one follow-up question/comment on the EnsDb packages:

The reason they escaped my notice (and thus potentially will also others)
is that I expected such packages to be named "^TxDb...".

What actually argues against sticking to existing Bioc vocabulary and
naming eg EnsDb.Hsapiens.v79

TxDb.Hsapiens.Ensembl.hg38.ensGene

(or alternatively, if packages like BSgenome.Hsapiens.NCBI.GRCh38 will
indeed make it in the long run:  TxDb.Hsapiens.Ensembl.GRCh38.ensGene)

This would also have the advantage that genome build and idType could be
inferred right from the package name.

Best,
Ludwig

> dear Robert and Ludwig,
>
> the EnsDb packages provide all the gene/transcript etc annotations for all
> genes defined in the Ensembl database (for a given species and Ensembl
> release). Except the column/attribute "entrezid" that is stored in the
> internal database there is however no link to NCBI or UCSC annotations.
> So, basically, if you want to use "pure" Ensembl based annotations: use
> EnsDb, if you want to have the UCSC annotations: use the TxDb packages.
>
> In case you need EnsDbs of other species or Ensembl versions, the
> ensembldb package provides functionality to generate such packages either
> using the Ensembl Perl API or using GTF files provided by Ensembl. If you
> have problems building the packages, just drop me a line and I'll do
> that.
>
> cheers, jo
>
>> On 03 Jun 2015, at 15:56, Robert M. Flight <rflight79 at gmail.com> wrote:
>>
>> Ludwig,
>>
>> If you do this search on the UCSC genome browser (which this annotation
>> package is built from), you will see that the longest variant is what
>> is
>> shown
>>
>> http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammal&org=Human&db=hg38&position=brca1&hgt.positionInput=brca1&hgt.suggestTrack=knownGene&Submit=submit&hgsid=429339723_8sd4QD2jSAnAsa6cVCevtoOy4GAz&pix=1885
>>
>> If instead of "genes" you do "transcripts", you will see 20 different
>> transcripts for this gene, including the one listed by NCBI.
>>
>> I havent tried it yet (haven't upgraded R or bioconductor to latest
>> version), but there is now an Ensembl based annotation package as well,
>> that may work better??
>> http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html
>>
>> -Robert
>>
>>
>>
>> On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger <
>> Ludwig.Geistlinger at bio.ifi.lmu.de> wrote:
>>
>>> Dear Bioc annotation team,
>>>
>>> Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g.
>>> for
>>>
>>> BRCA1; ENSG00000012048; entrez:672
>>>
>>> via
>>>
>>>> genes(TxDb.Hsapiens.UCSC.hg38.knownGene, vals=list(gene_id="672"))
>>>
>>> gives me:
>>>
>>> GRanges object with 1 range and 1 metadata column:
>>>      seqnames               ranges strand |     gene_id
>>>         <Rle>            <IRanges>  <Rle> | <character>
>>>  672    chr17 [43044295, 43170403]      - |         672
>>>  -------
>>>  seqinfo: 455 sequences (1 circular) from hg38 genome
>>>
>>>
>>> However, querying Ensembl and NCBI Gene
>>> http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000012048
>>> http://www.ncbi.nlm.nih.gov/gene/672
>>>
>>> the gene is located at (note the difference in the end position)
>>>
>>> Chromosome 17: 43,044,295-43,125,483 reverse strand
>>>
>>>
>>> How is the inconsistency explained and how to extract an ENSEMBL/NCBI
>>> conform annotation from the TxDb object?
>>> (I am aware of biomaRt, but I want to explicitely use the Bioc
>>> annotation
>>> functionality).
>>>
>>> Thanks!
>>> Ludwig
>>>
>>>
>>> --
>>> Dipl.-Bioinf. Ludwig Geistlinger
>>>
>>> Lehr- und Forschungseinheit fÃ¼r Bioinformatik
>>> Institut fÃ¼r Informatik
>>> Ludwig-Maximilians-UniversitÃ¤t MÃ¼nchen
>>> Amalienstrasse 17, 2. Stock, BÃ¼ro A201
>>> 80333 MÃ¼nchen
>>>
>>> Tel.: 089-2180-4067
>>> eMail: Ludwig.Geistlinger at bio.ifi.lmu.de
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>