[Bioc-devel] faster gene id conversion?

Sat Nov 22 12:32:09 CET 2014

On Sat, Nov 22, 2014 at 12:53 AM, Karl Stamm <karl.stamm at gmail.com> wrote:

> Question regarding gene name conversions. Once upon a time, I was doing a
> lot of gene name conversions, particularly from NM_#### to HGNC symbol or
> Entrez GeneID. I used bioMaRt successfully, and developed a cache matrix so
> I could quickly merge() it instead of calling out to a webservice
> repeatedly. Later the complexity of keeping the cache updated became
> overwhelming, and carrying around a few megabytes of possibly outdated
> identifiers is a bad idea. Per Bioconductor guidelines, I switched to the
> built in annotation packages. Now I'm using org.Hs.eg.db's lookup lists
> org.Hs.egREFSEQ2EG and org.Hs.egSYMBOL.
>
> These sometimes map to multiple values and sometimes map to nothing,
> causing errors in my code. To clean it up, I wrapped their accessors with
> some error checking. Things work again, assigning one human readable name
> per transcript ID#. Problem is this method is very slow. I thought it could
> be the error checking code, but even trying to streamline that doesn't
> help. A profiler showed that most of my time was spent in .Call, actually
> it turns out each access to the "list" like this org.Hs.egSYMBOL[[eg]][1]
> was calling a sqlite query. Since I am nesting these calls in a loop, (NM
> to EG to HGNC, a few thousands of times), these copious calls out to sqlite
> are killing me.
>

Hi, Karl.

It is a little hard to diagnose problems without code, but here is a little
code to get a sense of how I might accomplish the task you are describing.
I include timing information.  If this isn't a representative workflow,
perhaps you can show us some code and timing information.

Sean

> # Get all human refseq accessions
> refseqs = keys(org.Hs.eg.db,keytype="REFSEQ")
> # Time the lookup for symbol and entrez ID
> system.time((symbols=select(org.Hs.eg.db,keytype="REFSEQ",
+                             keys=refseqs,
+                             columns=c('REFSEQ','SYMBOL','ENTREZID'))))
   user  system elapsed
  2.170   0.071   2.259
> head(symbols)
        REFSEQ SYMBOL ENTREZID
1    NM_130786   A1BG        1
2    NP_570602   A1BG        1
3    NM_000014    A2M        2
4    NP_000005    A2M        2
5 XM_006719056    A2M        2
6 XP_006719119    A2M        2

> I need a way to batch query, or preload to memory these lookup tables. I
> tried using a hash, but checking if a value is already loaded into the
> hash-cache is equally time consuming; and preloading the whole of
> org.Hs.eg.db takes a few hours. I could do it once, and cache the .RData
> object, but we're back to the local-outdated cache problem.
>
> So I think the only solution would be to access the sqlite underlying the
> org.Hs.eg.db myself, so I can use the batch query. Except that db is hidden
> under the R/API of these Anno-BiMap objects like org.Hs.egSYMBOL.
>
> I assume this problem has been handled before, and ask for your guidance.
>
> Thanks
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]