[Bioc-devel] faster gene id conversion?

Vincent Carey stvjc at channing.harvard.edu
Sat Nov 22 12:17:44 CET 2014


I am not sure this is a devel question.  Posing to the support site would
probably
lead to lots of insights from users and developers with different
approaches.

It seems to me you may not have worked with the OrganismDb framework

library(Homo.sapiens)

select(Homo.sapiens, keys=c("NM_001794"), keytype="REFSEQ",
columns=c("ENTREZID", "SYMBOL"))

     REFSEQ ENTREZID SYMBOL

1 NM_001794     1002   CDH4

If the performance here is inadequate please give some statistics with
microbenchmark
and reproducible examples and we will consider how to improve.

On Sat, Nov 22, 2014 at 12:53 AM, Karl Stamm <karl.stamm at gmail.com> wrote:

> Question regarding gene name conversions. Once upon a time, I was doing a
> lot of gene name conversions, particularly from NM_#### to HGNC symbol or
> Entrez GeneID. I used bioMaRt successfully, and developed a cache matrix so
> I could quickly merge() it instead of calling out to a webservice
> repeatedly. Later the complexity of keeping the cache updated became
> overwhelming, and carrying around a few megabytes of possibly outdated
> identifiers is a bad idea. Per Bioconductor guidelines, I switched to the
> built in annotation packages. Now I'm using org.Hs.eg.db's lookup lists
> org.Hs.egREFSEQ2EG and org.Hs.egSYMBOL.
>
> These sometimes map to multiple values and sometimes map to nothing,
> causing errors in my code. To clean it up, I wrapped their accessors with
> some error checking. Things work again, assigning one human readable name
> per transcript ID#. Problem is this method is very slow. I thought it could
> be the error checking code, but even trying to streamline that doesn't
> help. A profiler showed that most of my time was spent in .Call, actually
> it turns out each access to the "list" like this org.Hs.egSYMBOL[[eg]][1]
> was calling a sqlite query. Since I am nesting these calls in a loop, (NM
> to EG to HGNC, a few thousands of times), these copious calls out to sqlite
> are killing me.
>
> I need a way to batch query, or preload to memory these lookup tables. I
> tried using a hash, but checking if a value is already loaded into the
> hash-cache is equally time consuming; and preloading the whole of
> org.Hs.eg.db takes a few hours. I could do it once, and cache the .RData
> object, but we're back to the local-outdated cache problem.
>
> So I think the only solution would be to access the sqlite underlying the
> org.Hs.eg.db myself, so I can use the batch query. Except that db is hidden
> under the R/API of these Anno-BiMap objects like org.Hs.egSYMBOL.
>
> I assume this problem has been handled before, and ask for your guidance.
>
> Thanks
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list