[Bioc-devel] faster gene id conversion?

Karl Stamm karl.stamm at gmail.com
Sat Nov 22 06:53:24 CET 2014


Question regarding gene name conversions. Once upon a time, I was doing a
lot of gene name conversions, particularly from NM_#### to HGNC symbol or
Entrez GeneID. I used bioMaRt successfully, and developed a cache matrix so
I could quickly merge() it instead of calling out to a webservice
repeatedly. Later the complexity of keeping the cache updated became
overwhelming, and carrying around a few megabytes of possibly outdated
identifiers is a bad idea. Per Bioconductor guidelines, I switched to the
built in annotation packages. Now I'm using org.Hs.eg.db's lookup lists
org.Hs.egREFSEQ2EG and org.Hs.egSYMBOL.

These sometimes map to multiple values and sometimes map to nothing,
causing errors in my code. To clean it up, I wrapped their accessors with
some error checking. Things work again, assigning one human readable name
per transcript ID#. Problem is this method is very slow. I thought it could
be the error checking code, but even trying to streamline that doesn't
help. A profiler showed that most of my time was spent in .Call, actually
it turns out each access to the "list" like this org.Hs.egSYMBOL[[eg]][1]
was calling a sqlite query. Since I am nesting these calls in a loop, (NM
to EG to HGNC, a few thousands of times), these copious calls out to sqlite
are killing me.

I need a way to batch query, or preload to memory these lookup tables. I
tried using a hash, but checking if a value is already loaded into the
hash-cache is equally time consuming; and preloading the whole of
org.Hs.eg.db takes a few hours. I could do it once, and cache the .RData
object, but we're back to the local-outdated cache problem.

So I think the only solution would be to access the sqlite underlying the
org.Hs.eg.db myself, so I can use the batch query. Except that db is hidden
under the R/API of these Anno-BiMap objects like org.Hs.egSYMBOL.

I assume this problem has been handled before, and ask for your guidance.

Thanks

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list