[Bioc-devel] biomaRt and TxDb don't play nice together

Thu Oct 3 20:14:29 CEST 2019

Hi Michael,

tx_by_gene <- transcriptsBy(txdb, by='gene') returns a named list (or 
more precisely, a named GRangesList object) where the names are the gene 
ids. The show() method for these objects actually gives a clue that the 
list carries names:

 > tx_by_gene
GRangesList object of length 24467:
$`100009600`
GRanges object with 2 ranges and 2 metadata columns:
       seqnames            ranges strand |     tx_id              tx_name
          <Rle>         <IRanges>  <Rle> | <integer>          <character>
   [1]     chr9 21062393-21067093      - |     74163 ENSMUST00000115494.2
   [2]     chr9 21062400-21073096      - |     74165 ENSMUST00000213826.1
   -------
   seqinfo: 66 sequences (1 circular) from mm10 genome

$`100009609`
GRanges object with 4 ranges and 2 metadata columns:
       seqnames            ranges strand |     tx_id              tx_name
          <Rle>         <IRanges>  <Rle> | <integer>          <character>
   [1]     chr7 84935565-84964115      - |     60602 ENSMUST00000044583.9
   [2]     chr7 84935565-84964115      - |     60603 ENSMUST00000232823.1
   [3]     chr7 84935565-84964115      - |     60604 ENSMUST00000233024.1
   [4]     chr7 84935565-84964115      - |     60605 ENSMUST00000233469.1
   -------
   seqinfo: 66 sequences (1 circular) from mm10 genome

$`100009614`
GRanges object with 1 range and 2 metadata columns:
       seqnames            ranges strand |     tx_id              tx_name
          <Rle>         <IRanges>  <Rle> | <integer>          <character>
   [1]    chr10 77711457-77712009      + |     79323 ENSMUST00000075421.6
   -------
   seqinfo: 66 sequences (1 circular) from mm10 genome

...
<24464 more elements>

Note the difference with:

 > unname(tx_by_gene)
GRangesList object of length 24467:
[[1]]
GRanges object with 2 ranges and 2 metadata columns:
       seqnames            ranges strand |     tx_id              tx_name
          <Rle>         <IRanges>  <Rle> | <integer>          <character>
   [1]     chr9 21062393-21067093      - |     74163 ENSMUST00000115494.2
   [2]     chr9 21062400-21073096      - |     74165 ENSMUST00000213826.1
   -------
   seqinfo: 66 sequences (1 circular) from mm10 genome

[[2]]
GRanges object with 4 ranges and 2 metadata columns:
       seqnames            ranges strand |     tx_id              tx_name
          <Rle>         <IRanges>  <Rle> | <integer>          <character>
   [1]     chr7 84935565-84964115      - |     60602 ENSMUST00000044583.9
   [2]     chr7 84935565-84964115      - |     60603 ENSMUST00000232823.1
   [3]     chr7 84935565-84964115      - |     60604 ENSMUST00000233024.1
   [4]     chr7 84935565-84964115      - |     60605 ENSMUST00000233469.1
   -------
   seqinfo: 66 sequences (1 circular) from mm10 genome

[[3]]
GRanges object with 1 range and 2 metadata columns:
       seqnames            ranges strand |     tx_id              tx_name
          <Rle>         <IRanges>  <Rle> | <integer>          <character>
   [1]    chr10 77711457-77712009      + |     79323 ENSMUST00000075421.6
   -------
   seqinfo: 66 sequences (1 circular) from mm10 genome

...
<24464 more elements>

This is similar to what happens with a named list vs unnamed list in base R:

 > list(`44`="A", `5`="B")
$`44`
[1] "A"

$`5`
[1] "B"

 > unname(list(`44`="A", `5`="B"))
[[1]]
[1] "A"

[[2]]
[1] "B"

Like with any other named list in R, you should always use a subscript 
of type character when you lookup some particular genes with 
tx_by_gene[my_favorite_genes] or tx_by_gene[[my_favorite_gene]]. As 
you've learned, using a numeric subscript will get you the wrong answer 
because then the subsetting is positional.

I don't know what type the Ensembl people use to store the 
ensemblgene_id in their db but note that the UCSC people use a character 
type:

http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=765682895_SNRJX5rTEkoCtJclt55UGpncPIka&hgta_doSchemaDb=mm10&hgta_doSchemaTable=knownToLocusLink

Also TxDb objects always use a character type to store external gene ids 
like Ensembl ids.

H.

On 10/3/19 03:45, Michael Shapiro wrote:
> 
> Apologies for a previous email that seems content free.
> 
> I've run into a cosmic mis-match between biomaRt and TxDb which is either a bug or a bug waiting to happen.  In brief, biomaRt reports entrezgene_id as a numeric, but TxDb wants it as a character.  What's deadly in this is that TxDb doesn't fail from being supplied with the numeric, it simply accesses the wrong gene.  Here is a minimal example where I am trying to get from gene name (Kcnj12) to gene location:
> 
>    ## Resolve the gene name:
>    ensembl = useMart('ensembl', dataset='mmusculus_gene_ensembl')
>    geneNames=getBM(c('entrezgene_id', 'external_gene_name'), mart= ensembl)
>    idx = geneNames$external_gene_name == 'Kcnj12'
>    entrezGeneId = geneNames$entrezgene_id[idx]
> 
>    ## Get gene locations:
>    txdb = TxDb.Mmusculus.UCSC.mm10.knownGene
>    tbg =  transcriptsBy(txdb,by='gene')
> 
>    ## Shoot self in foot:
>    WRONG_LOCATION = tbg[[entrezGeneId]]
> 
>    ## Get email from biologist pointing out you've got the wrong gene:
>    ACTUAL_LOCATION = tbg[[as.character(entrezGeneId)]]
> 
> I would argue that if entrezgene_id is used in some places as a numeric and others as a character, it's safer if biomaRt returns it as a character.  If your code is wrong, you want it to fail, not quietly mis-perform.  A vector or list will always let you access it using a numeric even when this is wrong.  You will probably get an error if you try to access something with a character when you should be using a numeric.
> 
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=xxmO-W0aZN8W5TRjjzH4Ypv7FA6B4TkKXJaIJJ1_TVs&s=Ao9Zb1iHPmFD8QMfPu9sZIDSj-oYy5_MqLo7VRUP2ao&e=
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages using fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319