[Bioc-devel] biomaRt and TxDb don't play nice together

Michael Shapiro @||k@ @end|ng |rom e@rth||nk@net
Thu Oct 3 12:45:17 CEST 2019

Apologies for a previous email that seems content free.

I've run into a cosmic mis-match between biomaRt and TxDb which is either a bug or a bug waiting to happen.  In brief, biomaRt reports entrezgene_id as a numeric, but TxDb wants it as a character.  What's deadly in this is that TxDb doesn't fail from being supplied with the numeric, it simply accesses the wrong gene.  Here is a minimal example where I am trying to get from gene name (Kcnj12) to gene location:

  ## Resolve the gene name:
  ensembl = useMart('ensembl', dataset='mmusculus_gene_ensembl')
  geneNames=getBM(c('entrezgene_id', 'external_gene_name'), mart= ensembl)
  idx = geneNames$external_gene_name == 'Kcnj12'
  entrezGeneId = geneNames$entrezgene_id[idx]

  ## Get gene locations:
  txdb = TxDb.Mmusculus.UCSC.mm10.knownGene
  tbg =  transcriptsBy(txdb,by='gene')

  ## Shoot self in foot:
  WRONG_LOCATION = tbg[[entrezGeneId]]

  ## Get email from biologist pointing out you've got the wrong gene:
  ACTUAL_LOCATION = tbg[[as.character(entrezGeneId)]]

I would argue that if entrezgene_id is used in some places as a numeric and others as a character, it's safer if biomaRt returns it as a character.  If your code is wrong, you want it to fail, not quietly mis-perform.  A vector or list will always let you access it using a numeric even when this is wrong.  You will probably get an error if you try to access something with a character when you should be using a numeric.

More information about the Bioc-devel mailing list