[Bioc-devel] Known Genes replaced by GENCODE genes at UCSC

Tue Jan 12 08:14:06 CET 2016

Just as an info… EnsDb objects/packages (from ensembldb package) provide similar functionality than the TxDb, are tailored to Ensembl annotations and can be build from the GTF files from Ensembl (which can be fetched via AnnotationHub; it’s all described in the ensembldb vignette).

cheers, jo

> On 11 Jan 2016, at 21:40, Paul Grosu <pgrosu at gmail.com> wrote:
> 
> 
> Tim, you always crack me up! :)  I totally agree, and it would probably be
> good to also have the tools enabled to download directly from Ensembl, NCBI,
> cloud-annotation source, etc. and build/update the AnnDbBimap objects.  This
> way the annotation sources can maintain the data and us the scripts,
> including the pre-built AnnDbBimap objects just in case.
> 
> ~p
> 
> -----Original Message-----
> From: Bioc-devel [mailto:bioc-devel-bounces at r-project.org] On Behalf Of Tim
> Triche, Jr.
> Sent: Monday, January 11, 2016 2:02 PM
> To: Vincent Carey
> Cc: bioc-devel at r-project.org
> Subject: Re: [Bioc-devel] Known Genes replaced by GENCODE genes at UCSC
> 
> ENSEMBL
> 
> knownGene was always a disaster.  For extra amusement/horror, be sure to
> check out the sad saga of the TCGA GAF and its disconnection from knownGenes
> as well as reality.  Three cheers for rendering transcript-level estimates
> useless (and no this was not Katie's fault)
> 
> Rainer and many others have made a herculean effort to bring all the BioC
> annotation infrastructure into the 21st century... having worked with
> Kallisto extensively of late, I see no reason to use a non-ENSEMBL
> "conservative" reference transcriptome (I see plenty of reasons to use
> miTranscriptome, etc. but that is another discussion).
> 
> sorry if slighting anyone/everyone, but ENSEMBL is the clear choice IMHO.
> 
> $0.02 - transmission costs
> 
> 
> --t
> 
> On Mon, Jan 11, 2016 at 10:57 AM, Vincent Carey <stvjc at channing.harvard.edu>
> wrote:
> 
>> I think these are all good observations and we may benefit from a 
>> wider discussion on the support site?
>> 
>> the abandonment of knownGene seems to have clear implications for 
>> changing our most visible txdb examples.  what should we change to?  
>> can we make a more future-proof design for these annotation 
>> selections?
>> 
>> On Mon, Jan 11, 2016 at 1:40 PM, Robert Castelo 
>> <robert.castelo at upf.edu>
>> wrote:
>> 
>>> hi,
>>> 
>>> On 01/11/2016 04:07 PM, Vincent Carey wrote:
>>> [...]
>>> 
>>>> Is it true that there is an asymmetry between Entrez gene ID and 
>>>> Ensembl gene ID for querying org.Hs.eg.db (I tend to prefer 
>>>> Homo.sapiens as a symbol mapping resource)?  Both ENTREZID and 
>>>> ENSEMBL are listed as keytypes.  My question is whether this 
>>>> "anchor" concept holds in the current infrastructure.
>>>> 
>>> 
>>> you're right that the infrastructure is probably symmetric at least 
>>> between Entrez and Ensembl, so maybe i'm not using the term "anchor"
>>> correctly here, i'm just referring to the fact that many package
>> functions
>>> and use cases of BioC are based in, or illustrated, using Entrez IDs.
>>> examples are:
>>> 
>>> head(org.Hs.eg.db::keys(org.Hs.eg.db))
>>> [1] "1"  "2"  "3"  "9"  "10" "11"
>>> 
>>> i.e., by default the 'keytype' is 'ENTREZID'
>>> 
>>> genefilter::nsFilter() argument 'require.entrez' filters out 
>>> features without an Entrez Gene ID annotation.
>>> 
>>> Category::categoryToEntrezBuilder() returns a list mapping category 
>>> ids
>> to
>>> the Entrez Gene ids annotated at the cateogry id.
>>> 
>>> SummarizedExperiment::geneRangeMapper() takes a 'TxDb' object and a 
>>> keytype to map ranges to genes. By default the keytype is 'ENTREZID'
>>> 
>>> some of the workflows are also based on Entrez IDs, such as:
>>> 
>>> 
>> http://www.bioconductor.org/help/workflows/annotation/Annotation_Resou
>> rces
>>> 
>>> http://www.bioconductor.org/help/workflows/variants
>>> 
>>> so if the user just replaces the txdb object in one of those 
>>> examples or argument functions by a txdb object that does not have 
>>> Entrez identifiers as primary gene key, those functions, examples or 
>>> workflows will require modification. this is not necessarily bad, 
>>> but may put more burden on the user who is learning with a "default"
> TxDb human gene annotation package.
>>> this has been so far the *.UCSC.knownGene using Entrez as gene
>> identifiers.
>>> given the apparent discontinuity of UCSC with the known gene track, 
>>> i
>> would
>>> suggest to put available at the BioC site another default gene 
>>> annotation package, but then one based on Entrez identifiers given 
>>> the amount of legacy code and documentation using Entrez in one way or
> another.
>>> 
>>> an alternative to translating the default Ensembl Gencode 
>>> identifiers
>> into
>>> Entrez would be to just take the NCBI RefSeq annotations as human 
>>> gene annotation package available by default, i.e., replacing 
>>> current *.UCSC.knownGene by *.UCSC.refGene
>>> 
>>> 
>>> 
>>> robert.
>>> 
>> 
>>        [[alternative HTML version deleted]]
>> 
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> 
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel