[Bioc-devel] Known Genes replaced by GENCODE genes at UCSC
Robert Castelo
robert.castelo at upf.edu
Mon Jan 11 15:29:07 CET 2016
hi,
if i'm interpreting this correctly, the news archive of the UCSC Genome
Browser accessible here:
http://genome.ucsc.edu/goldenPath/newsarch.html
announced on June 29th, 2015, that they are discontinuing the generation
of UCSC Known Genes annotations for human, and provide the Gencode
annotations as default replacement.
the BioC site provides as default gene annotations for human the UCSC
Known Genes track and currently does not provide the Gencode annotations.
the GenomicFeatures package allows one to build such an annotation
package. unfortunately the current "supported" UCSC tables that can be
easily used via 'makeTxDbPackageFromUCSC()' reports up to Gencode
version V17:
library(GenomicFeatures)
xx <- supportedUCSCtables()
xx[grep("GENCODE Genes", xx$track), ]
track subtrack
wgEncodeGencodeBasicV17 GENCODE Genes V17 <NA>
wgEncodeGencodeCompV17 GENCODE Genes V17 <NA>
wgEncodeGencodePseudoGeneV17 GENCODE Genes V17 <NA>
wgEncodeGencode2wayConsPseudoV17 GENCODE Genes V17 <NA>
wgEncodeGencodePolyaV17 GENCODE Genes V17 <NA>
wgEncodeGencodeBasicV14 GENCODE Genes V14 <NA>
wgEncodeGencodeCompV14 GENCODE Genes V14 <NA>
wgEncodeGencodePseudoGeneV14 GENCODE Genes V14 <NA>
wgEncodeGencode2wayConsPseudoV14 GENCODE Genes V14 <NA>
wgEncodeGencodePolyaV14 GENCODE Genes V14 <NA>
wgEncodeGencodeBasicV7 GENCODE Genes V7 <NA>
wgEncodeGencodeCompV7 GENCODE Genes V7 <NA>
wgEncodeGencodePseudoGeneV7 GENCODE Genes V7 <NA>
wgEncodeGencode2wayConsPseudoV7 GENCODE Genes V7 <NA>
wgEncodeGencodePolyaV7 GENCODE Genes V7 <NA>
which is about 2 years old. current Gencode gene annotations are V24 and
at least V22 was available at:
http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database
before the last BioC release.
according to a recent announcement at the BioC support site:
https://support.bioconductor.org/p/71574
AnnotationHub seems to be now the proper way to import the most recent
Gencode annotations into BioC. however, at least in my hands, making the
corresponding TxDb object produces an error; see the following example:
library(AnnotationHub)
ah <- AnnotationHub()
human_gff <- query(ah, c("Gencode", "gff", "human"))
gencodeV23basicGFF <- ah[["AH49556"]]
metadata <- data.frame(name=c("Data source", "Genome", "Organism",
"Resource URL", "Full dataset"),
value=c(ah["AH49556"]$dataprovider,
ah["AH49556"]$genome,
ah["AH49556"]$species,
ah["AH49556"]$sourceurl, "no"))
txdb <- makeTxDbFromGRanges(gencodeV23basicGFF, metadata=metadata)
Error in .merge_transcript_parts(transcripts) :
The following transcripts have multiple parts that cannot be merged
because of incompatible seqnames: ENST00000244174.9,
ENST00000262640.10, ENST00000286448.10, ENST00000302805.6,
ENST00000313871.7, ENST00000326153.8, ENST00000331035.8,
ENST00000334060.7, ENST00000334651.9, ENST00000355432.7,
ENST00000355805.6, ENST00000359512.7, ENST00000369423.6,
ENST00000381180.7, ENST00000381184.5, ENST00000381187.7,
ENST00000381192.7, ENST00000381218.7, ENST00000381222.6,
ENST00000381223.8, ENST00000381229.8, ENST00000381233.7,
ENST00000381241.7, ENST00000381261.7, ENST00000381297.8,
ENST00000381317.7, ENST00000381333.8, ENST00000381401.9,
ENST00000381469.6, ENST00000381500.5, ENST00000381509.7,
ENST00000381524.7, ENST00000381529.7, ENST00000381566.5,
ENST00000381567.7, ENST00000381575.5, ENST00000381578.5,
ENST00000381657.6, ENST00000381663.7, ENST00000390665.7,
ENST00000391707.6, ENST00000399012.5, ENST00000399966.8,
ENST00000400841.6, ENST00000411342.5, ENST00000412936
on top of this, even if it would work, these annotations are anchored at
Ensembl Gene identifiers while the gene-centric annotations at
org.Hs.eg.db are anchored at Entrez Gene identifiers. this means that
more code would have to be involved to add the corresponding Entrez IDs
(resolving multiplicities, etc.) and produce a TxDb package that can be
used across many of the typical BioC pipelines.
since human gene annotations are at the core of many BioC pipelines, i'd
like to suggest for the forthcoming release cycles, that the BioC core
team packages Gencode annotations anchored at Entrez IDs, at least what
is called the "basic set", similarly to what is done with
TxDb.Hsapiens.UCSC.knownGene to have an easy starting point for the
analysis of human data.
cheers,
robert.
More information about the Bioc-devel
mailing list