[Bioc-devel] biomaRt and TxDb don't play nice together
Pages, Herve
hp@ge@ @end|ng |rom |redhutch@org
Thu Oct 3 20:14:29 CEST 2019
Hi Michael,
tx_by_gene <- transcriptsBy(txdb, by='gene') returns a named list (or
more precisely, a named GRangesList object) where the names are the gene
ids. The show() method for these objects actually gives a clue that the
list carries names:
> tx_by_gene
GRangesList object of length 24467:
$`100009600`
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr9 21062393-21067093 - | 74163 ENSMUST00000115494.2
[2] chr9 21062400-21073096 - | 74165 ENSMUST00000213826.1
-------
seqinfo: 66 sequences (1 circular) from mm10 genome
$`100009609`
GRanges object with 4 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr7 84935565-84964115 - | 60602 ENSMUST00000044583.9
[2] chr7 84935565-84964115 - | 60603 ENSMUST00000232823.1
[3] chr7 84935565-84964115 - | 60604 ENSMUST00000233024.1
[4] chr7 84935565-84964115 - | 60605 ENSMUST00000233469.1
-------
seqinfo: 66 sequences (1 circular) from mm10 genome
$`100009614`
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr10 77711457-77712009 + | 79323 ENSMUST00000075421.6
-------
seqinfo: 66 sequences (1 circular) from mm10 genome
...
<24464 more elements>
Note the difference with:
> unname(tx_by_gene)
GRangesList object of length 24467:
[[1]]
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr9 21062393-21067093 - | 74163 ENSMUST00000115494.2
[2] chr9 21062400-21073096 - | 74165 ENSMUST00000213826.1
-------
seqinfo: 66 sequences (1 circular) from mm10 genome
[[2]]
GRanges object with 4 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr7 84935565-84964115 - | 60602 ENSMUST00000044583.9
[2] chr7 84935565-84964115 - | 60603 ENSMUST00000232823.1
[3] chr7 84935565-84964115 - | 60604 ENSMUST00000233024.1
[4] chr7 84935565-84964115 - | 60605 ENSMUST00000233469.1
-------
seqinfo: 66 sequences (1 circular) from mm10 genome
[[3]]
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr10 77711457-77712009 + | 79323 ENSMUST00000075421.6
-------
seqinfo: 66 sequences (1 circular) from mm10 genome
...
<24464 more elements>
This is similar to what happens with a named list vs unnamed list in base R:
> list(`44`="A", `5`="B")
$`44`
[1] "A"
$`5`
[1] "B"
> unname(list(`44`="A", `5`="B"))
[[1]]
[1] "A"
[[2]]
[1] "B"
Like with any other named list in R, you should always use a subscript
of type character when you lookup some particular genes with
tx_by_gene[my_favorite_genes] or tx_by_gene[[my_favorite_gene]]. As
you've learned, using a numeric subscript will get you the wrong answer
because then the subsetting is positional.
I don't know what type the Ensembl people use to store the
ensemblgene_id in their db but note that the UCSC people use a character
type:
http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=765682895_SNRJX5rTEkoCtJclt55UGpncPIka&hgta_doSchemaDb=mm10&hgta_doSchemaTable=knownToLocusLink
Also TxDb objects always use a character type to store external gene ids
like Ensembl ids.
H.
On 10/3/19 03:45, Michael Shapiro wrote:
>
> Apologies for a previous email that seems content free.
>
> I've run into a cosmic mis-match between biomaRt and TxDb which is either a bug or a bug waiting to happen. In brief, biomaRt reports entrezgene_id as a numeric, but TxDb wants it as a character. What's deadly in this is that TxDb doesn't fail from being supplied with the numeric, it simply accesses the wrong gene. Here is a minimal example where I am trying to get from gene name (Kcnj12) to gene location:
>
> ## Resolve the gene name:
> ensembl = useMart('ensembl', dataset='mmusculus_gene_ensembl')
> geneNames=getBM(c('entrezgene_id', 'external_gene_name'), mart= ensembl)
> idx = geneNames$external_gene_name == 'Kcnj12'
> entrezGeneId = geneNames$entrezgene_id[idx]
>
> ## Get gene locations:
> txdb = TxDb.Mmusculus.UCSC.mm10.knownGene
> tbg = transcriptsBy(txdb,by='gene')
>
> ## Shoot self in foot:
> WRONG_LOCATION = tbg[[entrezGeneId]]
>
> ## Get email from biologist pointing out you've got the wrong gene:
> ACTUAL_LOCATION = tbg[[as.character(entrezGeneId)]]
>
> I would argue that if entrezgene_id is used in some places as a numeric and others as a character, it's safer if biomaRt returns it as a character. If your code is wrong, you want it to fail, not quietly mis-perform. A vector or list will always let you access it using a numeric even when this is wrong. You will probably get an error if you try to access something with a character when you should be using a numeric.
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=xxmO-W0aZN8W5TRjjzH4Ypv7FA6B4TkKXJaIJJ1_TVs&s=Ao9Zb1iHPmFD8QMfPu9sZIDSj-oYy5_MqLo7VRUP2ao&e=
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages using fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list