[Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

Tue Jun 9 16:25:49 CEST 2015

On 06/08/2015 11:43 PM, Rainer Johannes wrote:
> dear Robert and Ludwig,
>
> the EnsDb packages provide all the gene/transcript etc annotations for all
> genes defined in the Ensembl database (for a given species and Ensembl
> release). Except the column/attribute "entrezid" that is stored in the
> internal database there is however no link to NCBI or UCSC annotations. So,
> basically, if you want to use "pure" Ensembl based annotations: use EnsDb, if
> you want to have the UCSC annotations: use the TxDb packages.
>
> In case you need EnsDbs of other species or Ensembl versions, the ensembldb
> package provides functionality to generate such packages either using the
> Ensembl Perl API or using GTF files provided by Ensembl. If you have problems
> building the packages, just drop me a line and I'll do that.

Two other sources of Ensembl TxDb's are GenomicFeatures::makeTxDbFromBiomart() 
and AnnotationHub. For the latter, I'll add a variant of the following to the 
AnnotationHub HOWTO vignette 
http://bioconductor.org/packages/devel/bioc/html/AnnotationHub.html later today.

## Gene models

_Bioconductor_ represents gene models using 'transcript'
databases. These are available via packages such as
[TxDb.Hsapiens.UCSC.hg38.knownGene](http://bioconductor.org/packages/TxDb.Hsapiens.UCSC.knownGene.html),
or can be constructed using functions such as
`[GenomicFeatures](http://bioconductor.org/packages/GenomicFeatures.html)::makeTxDbFromBiomart()` 
or `GenomicFeatures::makeTxDbFromGRanges()`.

_AnnotationHub_ provides an easy way to work with gene models
published by Ensembl. Here we discover the Ensemble release 80 r
esources for pufferfish,_Takifugu rubripes_

```{r takifugu-gene-models}
query(ah, c("Takifugu", "release-80"))
```

We see that there is a GTF file, as well as various DNA
sequences. Let's retrieve the GTF and top-level sequence files. The
GTF file is imported as a _GRanges_ instance, the DNA sequence as a
compressed, indexed Fasta file

```{r takifugi-data}
gtf <- ah[["AH47101"]]
dna <- ah[["AH47477"]]

head(gtf, 3)
dna
head(seqlevels(dna))
```

It is trivial to make a TxDb instance

```{r takifugi-txdb}
library(GenomicFeatures)
txdb <- makeTxDbFromGRanges(gtf)
````

and to use that in conjunction with the DNA sequence, e.g., to find
exon sequences of all annotated genes.

```{r takifugi-exons}
library(Rsamtools)     # for getSeq,FaFile-method
exons <- exons(txdb)
getSeq(dna, exons)
```

Some difficulties arise when working with this partly assembled genome
that require more advanced GenomicRanges skills, see the
[GenomicRanges](http://bioconductor.org/packages/GenomicRanges.html)
vignettes, especially "GenomicRanges HOWTOs" and "An Introduction to
GenomicRanges".

>
> cheers, jo
>
>> On 03 Jun 2015, at 15:56, Robert M. Flight <rflight79 at gmail.com> wrote:
>>
>> Ludwig,
>>
>> If you do this search on the UCSC genome browser (which this annotation
>> package is built from), you will see that the longest variant is what is
>> shown
>>
>> http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammal&org=Human&db=hg38&position=brca1&hgt.positionInput=brca1&hgt.suggestTrack=knownGene&Submit=submit&hgsid=429339723_8sd4QD2jSAnAsa6cVCevtoOy4GAz&pix=1885
>>
>>
>>
If instead of "genes" you do "transcripts", you will see 20 different
>> transcripts for this gene, including the one listed by NCBI.
>>
>> I havent tried it yet (haven't upgraded R or bioconductor to latest
>> version), but there is now an Ensembl based annotation package as well,
>> that may work better??
>> http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html
>>
>>
>>
-Robert
>>
>>
>>
>> On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger <
>> Ludwig.Geistlinger at bio.ifi.lmu.de> wrote:
>>
>>> Dear Bioc annotation team,
>>>
>>> Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g.
>>> for
>>>
>>> BRCA1; ENSG00000012048; entrez:672
>>>
>>> via
>>>
>>>> genes(TxDb.Hsapiens.UCSC.hg38.knownGene, vals=list(gene_id="672"))
>>>
>>> gives me:
>>>
>>> GRanges object with 1 range and 1 metadata column: seqnames
>>> ranges strand |     gene_id <Rle>            <IRanges>  <Rle> |
>>> <character> 672    chr17 [43044295, 43170403]      - |         672
>>> ------- seqinfo: 455 sequences (1 circular) from hg38 genome
>>>
>>>
>>> However, querying Ensembl and NCBI Gene
>>> http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000012048
>>>
>>>
http://www.ncbi.nlm.nih.gov/gene/672
>>>
>>> the gene is located at (note the difference in the end position)
>>>
>>> Chromosome 17: 43,044,295-43,125,483 reverse strand
>>>
>>>
>>> How is the inconsistency explained and how to extract an ENSEMBL/NCBI
>>> conform annotation from the TxDb object? (I am aware of biomaRt, but I
>>> want to explicitely use the Bioc annotation functionality).
>>>
>>> Thanks! Ludwig
>>>
>>>
>>> -- Dipl.-Bioinf. Ludwig Geistlinger
>>>
>>> Lehr- und Forschungseinheit für Bioinformatik Institut für Informatik
>>> Ludwig-Maximilians-Universität München Amalienstrasse 17, 2. Stock, Büro
>>> A201 80333 München
>>>
>>> Tel.: 089-2180-4067 eMail: Ludwig.Geistlinger at bio.ifi.lmu.de
>>>
>>> _______________________________________________ Bioc-devel at r-project.org
>>> mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________ Bioc-devel at r-project.org
>> mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> _______________________________________________ Bioc-devel at r-project.org
> mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793