[Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

Mon Jan 11 23:36:01 CET 2016

that looks great, thanks Hervé for addressing this quickly.

robert.

On 1/11/16 11:18 PM, Hervé Pagès wrote:
> With GenomicFeatures 1.23.16:
>
> > txdb <- makeTxDbFromUCSC("hg38", "knownGene")
> Download the knownGene table ... OK
> Download the knownToLocusLink table ... OK
> Extract the 'transcripts' data frame ... OK
> Extract the 'splicings' data frame ... OK
> Download and preprocess the 'chrominfo' data frame ... OK
> Prepare the 'metadata' data frame ... OK
> Make the TxDb object ... OK
> Warning message:
> In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) :
>   UCSC data anomaly in 19942 transcript(s): the cds cumulative length is
>   not a multiple of 3 for transcripts ‘uc057axw.1’ ‘uc031pko.2’
>   ‘uc057ayh.1’ ‘uc057ayr.1’ ‘uc057azl.1’ ‘uc057azn.1’ ‘uc057azp.1’
>   ‘uc057azt.1’ ‘uc057azy.1’ ‘uc057bad.1’ ‘uc057bap.1’ ‘uc057bav.1’
>   ‘uc057bay.1’ ‘uc057bbh.1’ ‘uc057bbv.1’ ‘uc057bcm.1’ ‘uc057bcp.1’
>   ‘uc057bcw.1’ ‘uc057bcx.1’ ‘uc057bdf.1’ ‘uc057bdj.1’ ‘uc057bdk.1’
>   ‘uc057bdl.1’ ‘uc057bdp.1’ ‘uc057beb.1’ ‘uc057beq.1’ ‘uc057bfn.1’
>   ‘uc057bfs.1’ ‘uc057bgm.1’ ‘uc057bgo.1’ ‘uc057bgt.1’ ‘uc057bhd.1’
>   ‘uc057bhi.1’ ‘uc057bio.1’ ‘uc057biu.1’ ‘uc057bji.1’ ‘uc057bjj.1’
>   ‘uc057bjk.1’ ‘uc057bkd.1’ ‘uc057bkf.1’ ‘uc057bkj.1’ ‘uc057bkl.1’
>   ‘uc057bkt.1’ ‘uc057bld.1’ ‘uc057bli.1’ ‘uc057blj.1’ ‘uc057blz.1’
>   ‘uc057bmf.1’ ‘uc057bmr.1’ ‘uc057bmv.1’ ‘uc057bni.1’ ‘u [... truncated]
>
> > txdb
> TxDb object:
> # Db type: TxDb
> # Supporting package: GenomicFeatures
> # Data source: UCSC
> # Genome: hg38
> # Organism: Homo sapiens
> # Taxonomy ID: 9606
> # UCSC Table: knownGene
> # UCSC Track: GENCODE v22
> # Resource URL: http://genome.ucsc.edu/
> # Type of Gene ID: Entrez Gene ID
> # Full dataset: yes
> # miRBase build ID: NA
> # transcript_nrow: 195178
> # exon_nrow: 575044
> # cds_nrow: 291225
> # Db created by: GenomicFeatures package from Bioconductor
> # Creation time: 2016-01-11 14:10:24 -0800 (Mon, 11 Jan 2016)
> # GenomicFeatures version at creation time: 1.23.16
> # RSQLite version at creation time: 1.0.0
> # DBSCHEMAVERSION: 1.1
>
> Note the new "UCSC Track" field above.
>
> Cheers,
> H.
>
>
> On 01/11/2016 01:12 PM, Hervé Pagès wrote:
>> Hi Robert and others,
>>
>> I looked at this and the new situation doesn't seem as disruptive as
>> it sounds. The bulk of the data for both tracks (i.e. the "UCSC Genes"
>> track for hg19 and the "GENCODE v22" track for hg38) is stored in the
>> knownGene table.
>>
>> The hg19.knownGene table is described here:
>>
>>
>> https://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=genes&hgta_track=knownGene&hgta_table=knownGene&hgta_doSchema=describe+table+schema 
>>
>>
>>
>> The hg38.knownGene table is described here:
>>
>>
>> https://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_group=genes&hgta_track=knownGene&hgta_table=knownGene&hgta_doSchema=describe+table+schema 
>>
>>
>>
>> The 2 pages are very similar. In particular both tables are connected
>> to the knownToLocusLink table where Entrez Gene IDs are stored.
>>
>> So from a makeTxDbFromUCSC() point of view everything looks the same
>> except the name of the track: "UCSC Genes" for hg19 and "GENCODE v22"
>> for hg38. That means it shouldn't be hard to tweak makeTxDbFromUCSC()
>> to support:
>>
>>      txdb <- makeTxDbFromUCSC("hg38", "knownGene")
>>
>> The returned 'txdb' will contain data from the "GENCODE v22" track
>> and with transcripts mapped to Entrez Gene IDs.
>>
>> I'll work on this and will also investigate makeTxDbFromGRanges's
>> failure on AnnotationHub's GFF files from GENCODE.
>>
>> H.
>>
>>
>> On 01/11/2016 06:29 AM, Robert Castelo wrote:
>>> hi,
>>>
>>> if i'm interpreting this correctly, the news archive of the UCSC Genome
>>> Browser accessible here:
>>>
>>>   http://genome.ucsc.edu/goldenPath/newsarch.html
>>>
>>> announced on June 29th, 2015, that they are discontinuing the 
>>> generation
>>> of UCSC Known Genes annotations for human, and provide the Gencode
>>> annotations as default replacement.
>>>
>>> the BioC site provides as default gene annotations for human the UCSC
>>> Known Genes track and currently does not provide the Gencode 
>>> annotations.
>>>
>>> the GenomicFeatures package allows one to build such an annotation
>>> package. unfortunately the current "supported" UCSC tables that can be
>>> easily used via 'makeTxDbPackageFromUCSC()' reports up to Gencode
>>> version V17:
>>>
>>> library(GenomicFeatures)
>>>
>>> xx <- supportedUCSCtables()
>>> xx[grep("GENCODE Genes", xx$track), ]
>>>                                               track subtrack
>>> wgEncodeGencodeBasicV17          GENCODE Genes V17 <NA>
>>> wgEncodeGencodeCompV17           GENCODE Genes V17 <NA>
>>> wgEncodeGencodePseudoGeneV17     GENCODE Genes V17 <NA>
>>> wgEncodeGencode2wayConsPseudoV17 GENCODE Genes V17 <NA>
>>> wgEncodeGencodePolyaV17          GENCODE Genes V17 <NA>
>>> wgEncodeGencodeBasicV14          GENCODE Genes V14 <NA>
>>> wgEncodeGencodeCompV14           GENCODE Genes V14 <NA>
>>> wgEncodeGencodePseudoGeneV14     GENCODE Genes V14 <NA>
>>> wgEncodeGencode2wayConsPseudoV14 GENCODE Genes V14 <NA>
>>> wgEncodeGencodePolyaV14          GENCODE Genes V14 <NA>
>>> wgEncodeGencodeBasicV7            GENCODE Genes V7 <NA>
>>> wgEncodeGencodeCompV7             GENCODE Genes V7 <NA>
>>> wgEncodeGencodePseudoGeneV7       GENCODE Genes V7 <NA>
>>> wgEncodeGencode2wayConsPseudoV7   GENCODE Genes V7 <NA>
>>> wgEncodeGencodePolyaV7            GENCODE Genes V7 <NA>
>>>
>>> which is about 2 years old. current Gencode gene annotations are V24 
>>> and
>>> at least V22 was available at:
>>>
>>> http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database
>>>
>>> before the last BioC release.
>>>
>>> according to a recent announcement at the BioC support site:
>>>
>>> https://support.bioconductor.org/p/71574
>>>
>>> AnnotationHub seems to be now the proper way to import the most recent
>>> Gencode annotations into BioC. however, at least in my hands, making 
>>> the
>>> corresponding TxDb object produces an error; see the following example:
>>>
>>> library(AnnotationHub)
>>>
>>> ah <- AnnotationHub()
>>> human_gff <- query(ah, c("Gencode", "gff", "human"))
>>>
>>> gencodeV23basicGFF <- ah[["AH49556"]]
>>> metadata <- data.frame(name=c("Data source", "Genome", "Organism",
>>>                                "Resource URL", "Full dataset"),
>>>                         value=c(ah["AH49556"]$dataprovider,
>>> ah["AH49556"]$genome,
>>>                                 ah["AH49556"]$species,
>>> ah["AH49556"]$sourceurl, "no"))
>>> txdb <- makeTxDbFromGRanges(gencodeV23basicGFF, metadata=metadata)
>>> Error in .merge_transcript_parts(transcripts) :
>>>    The following transcripts have multiple parts that cannot be merged
>>> because of incompatible seqnames: ENST00000244174.9,
>>>    ENST00000262640.10, ENST00000286448.10, ENST00000302805.6,
>>> ENST00000313871.7, ENST00000326153.8, ENST00000331035.8,
>>>    ENST00000334060.7, ENST00000334651.9, ENST00000355432.7,
>>> ENST00000355805.6, ENST00000359512.7, ENST00000369423.6,
>>>    ENST00000381180.7, ENST00000381184.5, ENST00000381187.7,
>>> ENST00000381192.7, ENST00000381218.7, ENST00000381222.6,
>>>    ENST00000381223.8, ENST00000381229.8, ENST00000381233.7,
>>> ENST00000381241.7, ENST00000381261.7, ENST00000381297.8,
>>>    ENST00000381317.7, ENST00000381333.8, ENST00000381401.9,
>>> ENST00000381469.6, ENST00000381500.5, ENST00000381509.7,
>>>    ENST00000381524.7, ENST00000381529.7, ENST00000381566.5,
>>> ENST00000381567.7, ENST00000381575.5, ENST00000381578.5,
>>>    ENST00000381657.6, ENST00000381663.7, ENST00000390665.7,
>>> ENST00000391707.6, ENST00000399012.5, ENST00000399966.8,
>>>    ENST00000400841.6, ENST00000411342.5, ENST00000412936
>>>
>>>
>>> on top of this, even if it would work, these annotations are 
>>> anchored at
>>> Ensembl Gene identifiers while the gene-centric annotations at
>>> org.Hs.eg.db are anchored at Entrez Gene identifiers. this means that
>>> more code would have to be involved to add the corresponding Entrez IDs
>>> (resolving multiplicities, etc.) and produce a TxDb package that can be
>>> used across many of the typical BioC pipelines.
>>>
>>> since human gene annotations are at the core of many BioC pipelines, 
>>> i'd
>>> like to suggest for the forthcoming release cycles, that the BioC core
>>> team packages Gencode annotations anchored at Entrez IDs, at least what
>>> is called the "basic set", similarly to what is done with
>>> TxDb.Hsapiens.UCSC.knownGene to have an easy starting point for the
>>> analysis of human data.
>>>
>>>
>>> cheers,
>>>
>>> robert.
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>