[Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency
Ludwig.Geistlinger at bio.ifi.lmu.de
Wed Jun 10 10:29:32 CEST 2015
one follow-up question/comment on the EnsDb packages:
The reason they escaped my notice (and thus potentially will also others)
is that I expected such packages to be named "^TxDb...".
What actually argues against sticking to existing Bioc vocabulary and
naming eg EnsDb.Hsapiens.v79
(or alternatively, if packages like BSgenome.Hsapiens.NCBI.GRCh38 will
indeed make it in the long run: TxDb.Hsapiens.Ensembl.GRCh38.ensGene)
This would also have the advantage that genome build and idType could be
inferred right from the package name.
> dear Robert and Ludwig,
> the EnsDb packages provide all the gene/transcript etc annotations for all
> genes defined in the Ensembl database (for a given species and Ensembl
> release). Except the column/attribute "entrezid" that is stored in the
> internal database there is however no link to NCBI or UCSC annotations.
> So, basically, if you want to use "pure" Ensembl based annotations: use
> EnsDb, if you want to have the UCSC annotations: use the TxDb packages.
> In case you need EnsDbs of other species or Ensembl versions, the
> ensembldb package provides functionality to generate such packages either
> using the Ensembl Perl API or using GTF files provided by Ensembl. If you
> have problems building the packages, just drop me a line and I'll do
> cheers, jo
>> On 03 Jun 2015, at 15:56, Robert M. Flight <rflight79 at gmail.com> wrote:
>> If you do this search on the UCSC genome browser (which this annotation
>> package is built from), you will see that the longest variant is what
>> If instead of "genes" you do "transcripts", you will see 20 different
>> transcripts for this gene, including the one listed by NCBI.
>> I havent tried it yet (haven't upgraded R or bioconductor to latest
>> version), but there is now an Ensembl based annotation package as well,
>> that may work better??
>> On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger <
>> Ludwig.Geistlinger at bio.ifi.lmu.de> wrote:
>>> Dear Bioc annotation team,
>>> Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g.
>>> BRCA1; ENSG00000012048; entrez:672
>>>> genes(TxDb.Hsapiens.UCSC.hg38.knownGene, vals=list(gene_id="672"))
>>> gives me:
>>> GRanges object with 1 range and 1 metadata column:
>>> seqnames ranges strand | gene_id
>>> <Rle> <IRanges> <Rle> | <character>
>>> 672 chr17 [43044295, 43170403] - | 672
>>> seqinfo: 455 sequences (1 circular) from hg38 genome
>>> However, querying Ensembl and NCBI Gene
>>> the gene is located at (note the difference in the end position)
>>> Chromosome 17: 43,044,295-43,125,483 reverse strand
>>> How is the inconsistency explained and how to extract an ENSEMBL/NCBI
>>> conform annotation from the TxDb object?
>>> (I am aware of biomaRt, but I want to explicitely use the Bioc
>>> Dipl.-Bioinf. Ludwig Geistlinger
>>> Lehr- und Forschungseinheit fÃ¼r Bioinformatik
>>> Institut fÃ¼r Informatik
>>> Ludwig-Maximilians-UniversitÃ¤t MÃ¼nchen
>>> Amalienstrasse 17, 2. Stock, BÃ¼ro A201
>>> 80333 MÃ¼nchen
>>> Tel.: 089-2180-4067
>>> eMail: Ludwig.Geistlinger at bio.ifi.lmu.de
>>> Bioc-devel at r-project.org mailing list
>> [[alternative HTML version deleted]]
>> Bioc-devel at r-project.org mailing list
More information about the Bioc-devel