[Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

Tue Jan 12 07:53:47 CET 2016

Hi Vince, Robert,

On 01/11/2016 07:07 AM, Vincent Carey wrote:
> On Mon, Jan 11, 2016 at 9:29 AM, Robert Castelo <robert.castelo at upf.edu>
> wrote:
>
>> hi,
>>
>> if i'm interpreting this correctly, the news archive of the UCSC Genome
>> Browser accessible here:
>>
>>   http://genome.ucsc.edu/goldenPath/newsarch.html
>>
>> announced on June 29th, 2015, that they are discontinuing the generation
>> of UCSC Known Genes annotations for human, and provide the Gencode
>> annotations as default replacement.
>>
>> the BioC site provides as default gene annotations for human the UCSC
>> Known Genes track and currently does not provide the Gencode annotations.
>>
>> the GenomicFeatures package allows one to build such an annotation
>> package. unfortunately the current "supported" UCSC tables that can be
>> easily used via 'makeTxDbPackageFromUCSC()' reports up to Gencode version
>> V17:
>>
>> library(GenomicFeatures)
>>
>> xx <- supportedUCSCtables()
>> xx[grep("GENCODE Genes", xx$track), ]
>>                                               track subtrack
>> wgEncodeGencodeBasicV17          GENCODE Genes V17     <NA>
>> wgEncodeGencodeCompV17           GENCODE Genes V17     <NA>
>> wgEncodeGencodePseudoGeneV17     GENCODE Genes V17     <NA>
>> wgEncodeGencode2wayConsPseudoV17 GENCODE Genes V17     <NA>
>> wgEncodeGencodePolyaV17          GENCODE Genes V17     <NA>
>> wgEncodeGencodeBasicV14          GENCODE Genes V14     <NA>
>> wgEncodeGencodeCompV14           GENCODE Genes V14     <NA>
>> wgEncodeGencodePseudoGeneV14     GENCODE Genes V14     <NA>
>> wgEncodeGencode2wayConsPseudoV14 GENCODE Genes V14     <NA>
>> wgEncodeGencodePolyaV14          GENCODE Genes V14     <NA>
>> wgEncodeGencodeBasicV7            GENCODE Genes V7     <NA>
>> wgEncodeGencodeCompV7             GENCODE Genes V7     <NA>
>> wgEncodeGencodePseudoGeneV7       GENCODE Genes V7     <NA>
>> wgEncodeGencode2wayConsPseudoV7   GENCODE Genes V7     <NA>
>> wgEncodeGencodePolyaV7            GENCODE Genes V7     <NA>
>>
>> which is about 2 years old. current Gencode gene annotations are V24 and
>> at least V22 was available at:
>>
>> http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database
>>
>> before the last BioC release.
>>
>> according to a recent announcement at the BioC support site:
>>
>> https://support.bioconductor.org/p/71574
>>
>> AnnotationHub seems to be now the proper way to import the most recent
>> Gencode annotations into BioC. however, at least in my hands, making the
>> corresponding TxDb object produces an error; see the following example:
>>
>> library(AnnotationHub)
>>
>> ah <- AnnotationHub()
>> human_gff <- query(ah, c("Gencode", "gff", "human"))
>>
>> gencodeV23basicGFF <- ah[["AH49556"]]
>> metadata <- data.frame(name=c("Data source", "Genome", "Organism",
>>                                "Resource URL", "Full dataset"),
>>                         value=c(ah["AH49556"]$dataprovider,
>> ah["AH49556"]$genome,
>>                                 ah["AH49556"]$species,
>> ah["AH49556"]$sourceurl, "no"))
>> txdb <- makeTxDbFromGRanges(gencodeV23basicGFF, metadata=metadata)
>> Error in .merge_transcript_parts(transcripts) :
>>    The following transcripts have multiple parts that cannot be merged
>> because of incompatible seqnames: ENST00000244174.9,
>>
>
> should this be an error, or would a softer landing be more useful here?
>   warn and exclude the offensive elements, perhaps with an option
> to retrieve them through some special step (option or new function)?

This was actually a bug in makeTxDbFromGRanges(). It's fixed in
GenomicFeatures 1.22.8 (release) and 1.23.17 (devel). With this
fix:

 > txdb <- makeTxDbFromGRanges(gencodeV23basicGFF, metadata=metadata)
 > txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: Gencode
# Genome: GRCh38
# Organism: Homo sapiens
# Resource URL: 
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_23/gencode.v23.basic.annotation.gff3.gz
# Full dataset: no
# transcript_nrow: 100769
# exon_nrow: 676601
# cds_nrow: 535301
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2016-01-11 22:28:51 -0800 (Mon, 11 Jan 2016)
# GenomicFeatures version at creation time: 1.23.17
# RSQLite version at creation time: 1.0.0
# DBSCHEMAVERSION: 1.1

 > transcripts(txdb)
GRanges object with 100769 ranges and 2 metadata columns:
            seqnames         ranges strand   |     tx_id           tx_name
               <Rle>      <IRanges>  <Rle>   | <integer>       <character>
        [1]     chr1 [11869, 14409]      +   |         1 ENST00000456328.2
        [2]     chr1 [12010, 13670]      +   |         2 ENST00000450305.2
        [3]     chr1 [29554, 31097]      +   |         3 ENST00000473358.1
        [4]     chr1 [30267, 31109]      +   |         4 ENST00000469289.1
        [5]     chr1 [30366, 30503]      +   |         5 ENST00000607096.1
        ...      ...            ...    ... ...       ...               ...
   [100765]     chrM [ 5826,  5891]      -   |    100765 ENST00000387409.1
   [100766]     chrM [ 7446,  7514]      -   |    100766 ENST00000387416.2
   [100767]     chrM [14149, 14673]      -   |    100767 ENST00000361681.2
   [100768]     chrM [14674, 14742]      -   |    100768 ENST00000387459.1
   [100769]     chrM [15956, 16023]      -   |    100769 ENST00000387461.2
   -------
   seqinfo: 25 sequences from GRCh38 genome; no seqlengths

H.

>
>
>>    ENST00000262640.10, ENST00000286448.10, ENST00000302805.6,
>> ENST00000313871.7, ENST00000326153.8, ENST00000331035.8,
>>    ENST00000334060.7, ENST00000334651.9, ENST00000355432.7,
>> ENST00000355805.6, ENST00000359512.7, ENST00000369423.6,
>>    ENST00000381180.7, ENST00000381184.5, ENST00000381187.7,
>> ENST00000381192.7, ENST00000381218.7, ENST00000381222.6,
>>    ENST00000381223.8, ENST00000381229.8, ENST00000381233.7,
>> ENST00000381241.7, ENST00000381261.7, ENST00000381297.8,
>>    ENST00000381317.7, ENST00000381333.8, ENST00000381401.9,
>> ENST00000381469.6, ENST00000381500.5, ENST00000381509.7,
>>    ENST00000381524.7, ENST00000381529.7, ENST00000381566.5,
>> ENST00000381567.7, ENST00000381575.5, ENST00000381578.5,
>>    ENST00000381657.6, ENST00000381663.7, ENST00000390665.7,
>> ENST00000391707.6, ENST00000399012.5, ENST00000399966.8,
>>    ENST00000400841.6, ENST00000411342.5, ENST00000412936
>>
>>
>> on top of this, even if it would work, these annotations are anchored at
>> Ensembl Gene identifiers while the gene-centric annotations at org.Hs.eg.db
>
>
> Is it true that there is an asymmetry between Entrez gene ID and Ensembl
> gene ID for querying org.Hs.eg.db (I tend to prefer Homo.sapiens
> as a symbol mapping resource)?  Both ENTREZID and ENSEMBL are listed as
> keytypes.  My question is whether this "anchor" concept
> holds in the current infrastructure.
>
> are anchored at Entrez Gene identifiers. this means that more code would
>> have to be involved to add the corresponding Entrez IDs (resolving
>> multiplicities, etc.) and produce a TxDb package that can be used across
>> many of the typical BioC pipelines.
>>
>> since human gene annotations are at the core of many BioC pipelines, i'd
>> like to suggest for the forthcoming release cycles, that the BioC core team
>> packages Gencode annotations anchored at Entrez IDs, at least what is
>> called the "basic set", similarly to what is done with
>> TxDb.Hsapiens.UCSC.knownGene to have an easy starting point for the
>> analysis of human data.
>>
>>
>> cheers,
>>
>> robert.
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319