[Bioc-devel] faster gene id conversion?
Sean Davis
seandavi at gmail.com
Sat Nov 22 12:32:09 CET 2014
On Sat, Nov 22, 2014 at 12:53 AM, Karl Stamm <karl.stamm at gmail.com> wrote:
> Question regarding gene name conversions. Once upon a time, I was doing a
> lot of gene name conversions, particularly from NM_#### to HGNC symbol or
> Entrez GeneID. I used bioMaRt successfully, and developed a cache matrix so
> I could quickly merge() it instead of calling out to a webservice
> repeatedly. Later the complexity of keeping the cache updated became
> overwhelming, and carrying around a few megabytes of possibly outdated
> identifiers is a bad idea. Per Bioconductor guidelines, I switched to the
> built in annotation packages. Now I'm using org.Hs.eg.db's lookup lists
> org.Hs.egREFSEQ2EG and org.Hs.egSYMBOL.
>
> These sometimes map to multiple values and sometimes map to nothing,
> causing errors in my code. To clean it up, I wrapped their accessors with
> some error checking. Things work again, assigning one human readable name
> per transcript ID#. Problem is this method is very slow. I thought it could
> be the error checking code, but even trying to streamline that doesn't
> help. A profiler showed that most of my time was spent in .Call, actually
> it turns out each access to the "list" like this org.Hs.egSYMBOL[[eg]][1]
> was calling a sqlite query. Since I am nesting these calls in a loop, (NM
> to EG to HGNC, a few thousands of times), these copious calls out to sqlite
> are killing me.
>
Hi, Karl.
It is a little hard to diagnose problems without code, but here is a little
code to get a sense of how I might accomplish the task you are describing.
I include timing information. If this isn't a representative workflow,
perhaps you can show us some code and timing information.
Sean
> # Get all human refseq accessions
> refseqs = keys(org.Hs.eg.db,keytype="REFSEQ")
> # Time the lookup for symbol and entrez ID
> system.time((symbols=select(org.Hs.eg.db,keytype="REFSEQ",
+ keys=refseqs,
+ columns=c('REFSEQ','SYMBOL','ENTREZID'))))
user system elapsed
2.170 0.071 2.259
> head(symbols)
REFSEQ SYMBOL ENTREZID
1 NM_130786 A1BG 1
2 NP_570602 A1BG 1
3 NM_000014 A2M 2
4 NP_000005 A2M 2
5 XM_006719056 A2M 2
6 XP_006719119 A2M 2
> I need a way to batch query, or preload to memory these lookup tables. I
> tried using a hash, but checking if a value is already loaded into the
> hash-cache is equally time consuming; and preloading the whole of
> org.Hs.eg.db takes a few hours. I could do it once, and cache the .RData
> object, but we're back to the local-outdated cache problem.
>
> So I think the only solution would be to access the sqlite underlying the
> org.Hs.eg.db myself, so I can use the batch query. Except that db is hidden
> under the R/API of these Anno-BiMap objects like org.Hs.egSYMBOL.
>
> I assume this problem has been handled before, and ask for your guidance.
>
> Thanks
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list