[Bioc-devel] biomaRt and TxDb don't play nice together

James W. MacDonald jm@cdon @end|ng |rom uw@edu
Thu Oct 3 20:18:12 CEST 2019


The bioc-devel list is intended for questions pertaining to package
development, not questions/remarks about existing packages. For that sort
of thing, please use the support site, https://support.bioconductor.org.

To your point, a bug is something that happens that wasn't intended by the
developer. The developers of the TxDb infrastructure (and pretty much all
of the annotation packages) intend for all identifiers to be character. On
the other hand, biomaRt, which is a contributed package and which queries
and returns data from an online database intends for the Gene IDs to be
numeric, as that is what is returned by that database. It's not a bug for
one package to do one thing and another to do another thing! Different
people do different things when they develop packages, and to assume that
all of the ~1700 packages in Bioconductor are somehow set up such that
whatever results one package returns will be seamlessly useful as input to
another is not possible, and you shouldn't assume that it is.

On Thu, Oct 3, 2019 at 6:46 AM Michael Shapiro <sifka using earthlink.net> wrote:

>
> Apologies for a previous email that seems content free.
>
> I've run into a cosmic mis-match between biomaRt and TxDb which is either
> a bug or a bug waiting to happen.  In brief, biomaRt reports entrezgene_id
> as a numeric, but TxDb wants it as a character.  What's deadly in this is
> that TxDb doesn't fail from being supplied with the numeric, it simply
> accesses the wrong gene.  Here is a minimal example where I am trying to
> get from gene name (Kcnj12) to gene location:
>
>   ## Resolve the gene name:
>   ensembl = useMart('ensembl', dataset='mmusculus_gene_ensembl')
>   geneNames=getBM(c('entrezgene_id', 'external_gene_name'), mart= ensembl)
>   idx = geneNames$external_gene_name == 'Kcnj12'
>   entrezGeneId = geneNames$entrezgene_id[idx]
>
>   ## Get gene locations:
>   txdb = TxDb.Mmusculus.UCSC.mm10.knownGene
>   tbg =  transcriptsBy(txdb,by='gene')
>
>   ## Shoot self in foot:
>   WRONG_LOCATION = tbg[[entrezGeneId]]
>
>   ## Get email from biologist pointing out you've got the wrong gene:
>   ACTUAL_LOCATION = tbg[[as.character(entrezGeneId)]]
>
> I would argue that if entrezgene_id is used in some places as a numeric
> and others as a character, it's safer if biomaRt returns it as a
> character.  If your code is wrong, you want it to fail, not quietly
> mis-perform.  A vector or list will always let you access it using a
> numeric even when this is wrong.  You will probably get an error if you try
> to access something with a character when you should be using a numeric.
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>


-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list