[Bioc-devel] Known Genes replaced by GENCODE genes at UCSC
Vincent Carey
stvjc at channing.harvard.edu
Mon Jan 11 16:07:31 CET 2016
On Mon, Jan 11, 2016 at 9:29 AM, Robert Castelo <robert.castelo at upf.edu>
wrote:
> hi,
>
> if i'm interpreting this correctly, the news archive of the UCSC Genome
> Browser accessible here:
>
> http://genome.ucsc.edu/goldenPath/newsarch.html
>
> announced on June 29th, 2015, that they are discontinuing the generation
> of UCSC Known Genes annotations for human, and provide the Gencode
> annotations as default replacement.
>
> the BioC site provides as default gene annotations for human the UCSC
> Known Genes track and currently does not provide the Gencode annotations.
>
> the GenomicFeatures package allows one to build such an annotation
> package. unfortunately the current "supported" UCSC tables that can be
> easily used via 'makeTxDbPackageFromUCSC()' reports up to Gencode version
> V17:
>
> library(GenomicFeatures)
>
> xx <- supportedUCSCtables()
> xx[grep("GENCODE Genes", xx$track), ]
> track subtrack
> wgEncodeGencodeBasicV17 GENCODE Genes V17 <NA>
> wgEncodeGencodeCompV17 GENCODE Genes V17 <NA>
> wgEncodeGencodePseudoGeneV17 GENCODE Genes V17 <NA>
> wgEncodeGencode2wayConsPseudoV17 GENCODE Genes V17 <NA>
> wgEncodeGencodePolyaV17 GENCODE Genes V17 <NA>
> wgEncodeGencodeBasicV14 GENCODE Genes V14 <NA>
> wgEncodeGencodeCompV14 GENCODE Genes V14 <NA>
> wgEncodeGencodePseudoGeneV14 GENCODE Genes V14 <NA>
> wgEncodeGencode2wayConsPseudoV14 GENCODE Genes V14 <NA>
> wgEncodeGencodePolyaV14 GENCODE Genes V14 <NA>
> wgEncodeGencodeBasicV7 GENCODE Genes V7 <NA>
> wgEncodeGencodeCompV7 GENCODE Genes V7 <NA>
> wgEncodeGencodePseudoGeneV7 GENCODE Genes V7 <NA>
> wgEncodeGencode2wayConsPseudoV7 GENCODE Genes V7 <NA>
> wgEncodeGencodePolyaV7 GENCODE Genes V7 <NA>
>
> which is about 2 years old. current Gencode gene annotations are V24 and
> at least V22 was available at:
>
> http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database
>
> before the last BioC release.
>
> according to a recent announcement at the BioC support site:
>
> https://support.bioconductor.org/p/71574
>
> AnnotationHub seems to be now the proper way to import the most recent
> Gencode annotations into BioC. however, at least in my hands, making the
> corresponding TxDb object produces an error; see the following example:
>
> library(AnnotationHub)
>
> ah <- AnnotationHub()
> human_gff <- query(ah, c("Gencode", "gff", "human"))
>
> gencodeV23basicGFF <- ah[["AH49556"]]
> metadata <- data.frame(name=c("Data source", "Genome", "Organism",
> "Resource URL", "Full dataset"),
> value=c(ah["AH49556"]$dataprovider,
> ah["AH49556"]$genome,
> ah["AH49556"]$species,
> ah["AH49556"]$sourceurl, "no"))
> txdb <- makeTxDbFromGRanges(gencodeV23basicGFF, metadata=metadata)
> Error in .merge_transcript_parts(transcripts) :
> The following transcripts have multiple parts that cannot be merged
> because of incompatible seqnames: ENST00000244174.9,
>
should this be an error, or would a softer landing be more useful here?
warn and exclude the offensive elements, perhaps with an option
to retrieve them through some special step (option or new function)?
> ENST00000262640.10, ENST00000286448.10, ENST00000302805.6,
> ENST00000313871.7, ENST00000326153.8, ENST00000331035.8,
> ENST00000334060.7, ENST00000334651.9, ENST00000355432.7,
> ENST00000355805.6, ENST00000359512.7, ENST00000369423.6,
> ENST00000381180.7, ENST00000381184.5, ENST00000381187.7,
> ENST00000381192.7, ENST00000381218.7, ENST00000381222.6,
> ENST00000381223.8, ENST00000381229.8, ENST00000381233.7,
> ENST00000381241.7, ENST00000381261.7, ENST00000381297.8,
> ENST00000381317.7, ENST00000381333.8, ENST00000381401.9,
> ENST00000381469.6, ENST00000381500.5, ENST00000381509.7,
> ENST00000381524.7, ENST00000381529.7, ENST00000381566.5,
> ENST00000381567.7, ENST00000381575.5, ENST00000381578.5,
> ENST00000381657.6, ENST00000381663.7, ENST00000390665.7,
> ENST00000391707.6, ENST00000399012.5, ENST00000399966.8,
> ENST00000400841.6, ENST00000411342.5, ENST00000412936
>
>
> on top of this, even if it would work, these annotations are anchored at
> Ensembl Gene identifiers while the gene-centric annotations at org.Hs.eg.db
Is it true that there is an asymmetry between Entrez gene ID and Ensembl
gene ID for querying org.Hs.eg.db (I tend to prefer Homo.sapiens
as a symbol mapping resource)? Both ENTREZID and ENSEMBL are listed as
keytypes. My question is whether this "anchor" concept
holds in the current infrastructure.
are anchored at Entrez Gene identifiers. this means that more code would
> have to be involved to add the corresponding Entrez IDs (resolving
> multiplicities, etc.) and produce a TxDb package that can be used across
> many of the typical BioC pipelines.
>
> since human gene annotations are at the core of many BioC pipelines, i'd
> like to suggest for the forthcoming release cycles, that the BioC core team
> packages Gencode annotations anchored at Entrez IDs, at least what is
> called the "basic set", similarly to what is done with
> TxDb.Hsapiens.UCSC.knownGene to have an easy starting point for the
> analysis of human data.
>
>
> cheers,
>
> robert.
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list