[Bioc-devel] Known Genes replaced by GENCODE genes at UCSC
Hervé Pagès
hpages at fredhutch.org
Tue Jan 12 07:53:47 CET 2016
Hi Vince, Robert,
On 01/11/2016 07:07 AM, Vincent Carey wrote:
> On Mon, Jan 11, 2016 at 9:29 AM, Robert Castelo <robert.castelo at upf.edu>
> wrote:
>
>> hi,
>>
>> if i'm interpreting this correctly, the news archive of the UCSC Genome
>> Browser accessible here:
>>
>> http://genome.ucsc.edu/goldenPath/newsarch.html
>>
>> announced on June 29th, 2015, that they are discontinuing the generation
>> of UCSC Known Genes annotations for human, and provide the Gencode
>> annotations as default replacement.
>>
>> the BioC site provides as default gene annotations for human the UCSC
>> Known Genes track and currently does not provide the Gencode annotations.
>>
>> the GenomicFeatures package allows one to build such an annotation
>> package. unfortunately the current "supported" UCSC tables that can be
>> easily used via 'makeTxDbPackageFromUCSC()' reports up to Gencode version
>> V17:
>>
>> library(GenomicFeatures)
>>
>> xx <- supportedUCSCtables()
>> xx[grep("GENCODE Genes", xx$track), ]
>> track subtrack
>> wgEncodeGencodeBasicV17 GENCODE Genes V17 <NA>
>> wgEncodeGencodeCompV17 GENCODE Genes V17 <NA>
>> wgEncodeGencodePseudoGeneV17 GENCODE Genes V17 <NA>
>> wgEncodeGencode2wayConsPseudoV17 GENCODE Genes V17 <NA>
>> wgEncodeGencodePolyaV17 GENCODE Genes V17 <NA>
>> wgEncodeGencodeBasicV14 GENCODE Genes V14 <NA>
>> wgEncodeGencodeCompV14 GENCODE Genes V14 <NA>
>> wgEncodeGencodePseudoGeneV14 GENCODE Genes V14 <NA>
>> wgEncodeGencode2wayConsPseudoV14 GENCODE Genes V14 <NA>
>> wgEncodeGencodePolyaV14 GENCODE Genes V14 <NA>
>> wgEncodeGencodeBasicV7 GENCODE Genes V7 <NA>
>> wgEncodeGencodeCompV7 GENCODE Genes V7 <NA>
>> wgEncodeGencodePseudoGeneV7 GENCODE Genes V7 <NA>
>> wgEncodeGencode2wayConsPseudoV7 GENCODE Genes V7 <NA>
>> wgEncodeGencodePolyaV7 GENCODE Genes V7 <NA>
>>
>> which is about 2 years old. current Gencode gene annotations are V24 and
>> at least V22 was available at:
>>
>> http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database
>>
>> before the last BioC release.
>>
>> according to a recent announcement at the BioC support site:
>>
>> https://support.bioconductor.org/p/71574
>>
>> AnnotationHub seems to be now the proper way to import the most recent
>> Gencode annotations into BioC. however, at least in my hands, making the
>> corresponding TxDb object produces an error; see the following example:
>>
>> library(AnnotationHub)
>>
>> ah <- AnnotationHub()
>> human_gff <- query(ah, c("Gencode", "gff", "human"))
>>
>> gencodeV23basicGFF <- ah[["AH49556"]]
>> metadata <- data.frame(name=c("Data source", "Genome", "Organism",
>> "Resource URL", "Full dataset"),
>> value=c(ah["AH49556"]$dataprovider,
>> ah["AH49556"]$genome,
>> ah["AH49556"]$species,
>> ah["AH49556"]$sourceurl, "no"))
>> txdb <- makeTxDbFromGRanges(gencodeV23basicGFF, metadata=metadata)
>> Error in .merge_transcript_parts(transcripts) :
>> The following transcripts have multiple parts that cannot be merged
>> because of incompatible seqnames: ENST00000244174.9,
>>
>
> should this be an error, or would a softer landing be more useful here?
> warn and exclude the offensive elements, perhaps with an option
> to retrieve them through some special step (option or new function)?
This was actually a bug in makeTxDbFromGRanges(). It's fixed in
GenomicFeatures 1.22.8 (release) and 1.23.17 (devel). With this
fix:
> txdb <- makeTxDbFromGRanges(gencodeV23basicGFF, metadata=metadata)
> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: Gencode
# Genome: GRCh38
# Organism: Homo sapiens
# Resource URL:
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_23/gencode.v23.basic.annotation.gff3.gz
# Full dataset: no
# transcript_nrow: 100769
# exon_nrow: 676601
# cds_nrow: 535301
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2016-01-11 22:28:51 -0800 (Mon, 11 Jan 2016)
# GenomicFeatures version at creation time: 1.23.17
# RSQLite version at creation time: 1.0.0
# DBSCHEMAVERSION: 1.1
> transcripts(txdb)
GRanges object with 100769 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr1 [11869, 14409] + | 1 ENST00000456328.2
[2] chr1 [12010, 13670] + | 2 ENST00000450305.2
[3] chr1 [29554, 31097] + | 3 ENST00000473358.1
[4] chr1 [30267, 31109] + | 4 ENST00000469289.1
[5] chr1 [30366, 30503] + | 5 ENST00000607096.1
... ... ... ... ... ... ...
[100765] chrM [ 5826, 5891] - | 100765 ENST00000387409.1
[100766] chrM [ 7446, 7514] - | 100766 ENST00000387416.2
[100767] chrM [14149, 14673] - | 100767 ENST00000361681.2
[100768] chrM [14674, 14742] - | 100768 ENST00000387459.1
[100769] chrM [15956, 16023] - | 100769 ENST00000387461.2
-------
seqinfo: 25 sequences from GRCh38 genome; no seqlengths
H.
>
>
>> ENST00000262640.10, ENST00000286448.10, ENST00000302805.6,
>> ENST00000313871.7, ENST00000326153.8, ENST00000331035.8,
>> ENST00000334060.7, ENST00000334651.9, ENST00000355432.7,
>> ENST00000355805.6, ENST00000359512.7, ENST00000369423.6,
>> ENST00000381180.7, ENST00000381184.5, ENST00000381187.7,
>> ENST00000381192.7, ENST00000381218.7, ENST00000381222.6,
>> ENST00000381223.8, ENST00000381229.8, ENST00000381233.7,
>> ENST00000381241.7, ENST00000381261.7, ENST00000381297.8,
>> ENST00000381317.7, ENST00000381333.8, ENST00000381401.9,
>> ENST00000381469.6, ENST00000381500.5, ENST00000381509.7,
>> ENST00000381524.7, ENST00000381529.7, ENST00000381566.5,
>> ENST00000381567.7, ENST00000381575.5, ENST00000381578.5,
>> ENST00000381657.6, ENST00000381663.7, ENST00000390665.7,
>> ENST00000391707.6, ENST00000399012.5, ENST00000399966.8,
>> ENST00000400841.6, ENST00000411342.5, ENST00000412936
>>
>>
>> on top of this, even if it would work, these annotations are anchored at
>> Ensembl Gene identifiers while the gene-centric annotations at org.Hs.eg.db
>
>
> Is it true that there is an asymmetry between Entrez gene ID and Ensembl
> gene ID for querying org.Hs.eg.db (I tend to prefer Homo.sapiens
> as a symbol mapping resource)? Both ENTREZID and ENSEMBL are listed as
> keytypes. My question is whether this "anchor" concept
> holds in the current infrastructure.
>
> are anchored at Entrez Gene identifiers. this means that more code would
>> have to be involved to add the corresponding Entrez IDs (resolving
>> multiplicities, etc.) and produce a TxDb package that can be used across
>> many of the typical BioC pipelines.
>>
>> since human gene annotations are at the core of many BioC pipelines, i'd
>> like to suggest for the forthcoming release cycles, that the BioC core team
>> packages Gencode annotations anchored at Entrez IDs, at least what is
>> called the "basic set", similarly to what is done with
>> TxDb.Hsapiens.UCSC.knownGene to have an easy starting point for the
>> analysis of human data.
>>
>>
>> cheers,
>>
>> robert.
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list