[Bioc-devel] faster gene id conversion?

Karl Stamm karl.stamm at gmail.com
Mon Nov 24 18:02:39 CET 2014


Thanks everyone! I wasn't familiar with the select() function, it appears
to be specific to the AnnotationDb package, and not referenced by the
examples of org.Hs.eg.db, where I was looking. Those examples for things
like org.Hs.eg.dbREFSEQ call the list accessor or a vector accessor, and
always for the full list.
mapped_seqs <- mappedkeys(x)
xx <- as.list(x[mapped_seqs])
xx[[1]]

Following that pattern, I developed a big slow DisplayName() with tryCatch
for null values, and a double lookup to go from Refseq to EG to HGNC.

I've reimplemented following your advice (Thanks Sean Davis and Vincent
Carey). Examples with benchmarking are posted here:
http://pastebin.com/HjNbtD1g

Calling select() with a short vector of refseq IDs takes a similar amount
of time, between a half second and a full second for all three methods.
It's not an improvement. As my use-case is lots of calls for a few IDs,
this is still a problem.

However, my method looped the input ids and called one at a time, so
runtime is linear, whereas the select() methods are nearly constant time,
calling for fifty ids is therefore MUCH faster.  Calling the whole
database, as Sean showed, only uses a few seconds. This becomes an
opportunity to pre-cache, as I tried to do, but with the linear algorithm
needed more than an hour at package load time.

A fourth version of the function, using the select() to eat 3 seconds and
fill a hash(), subsequently blows the other methods out of the water. For
example, to pull 50 genes with my slow method was 7 seconds, by
select(org.Hs.eg.db) was 0.8 seconds, by select(Homo.sapiens) was 1.6
seconds, and by hash() is 0.003 seconds. (microbenchmark median of 30
evals).

Thanks to your help I can fill the hash in a reasonable time, and save it
on package load. Next step, I have to learn how to use the package
onLoad().



On Sat, Nov 22, 2014 at 5:32 AM, Sean Davis <seandavi at gmail.com> wrote:

>
>
> On Sat, Nov 22, 2014 at 12:53 AM, Karl Stamm <karl.stamm at gmail.com> wrote:
>
>> Question regarding gene name conversions. Once upon a time, I was doing a
>> lot of gene name conversions, particularly from NM_#### to HGNC symbol or
>> Entrez GeneID. I used bioMaRt successfully, and developed a cache matrix
>> so
>> I could quickly merge() it instead of calling out to a webservice
>> repeatedly. Later the complexity of keeping the cache updated became
>> overwhelming, and carrying around a few megabytes of possibly outdated
>> identifiers is a bad idea. Per Bioconductor guidelines, I switched to the
>> built in annotation packages. Now I'm using org.Hs.eg.db's lookup lists
>> org.Hs.egREFSEQ2EG and org.Hs.egSYMBOL.
>>
>> These sometimes map to multiple values and sometimes map to nothing,
>> causing errors in my code. To clean it up, I wrapped their accessors with
>> some error checking. Things work again, assigning one human readable name
>> per transcript ID#. Problem is this method is very slow. I thought it
>> could
>> be the error checking code, but even trying to streamline that doesn't
>> help. A profiler showed that most of my time was spent in .Call, actually
>> it turns out each access to the "list" like this org.Hs.egSYMBOL[[eg]][1]
>> was calling a sqlite query. Since I am nesting these calls in a loop, (NM
>> to EG to HGNC, a few thousands of times), these copious calls out to
>> sqlite
>> are killing me.
>>
>
> Hi, Karl.
>
> It is a little hard to diagnose problems without code, but here is a
> little code to get a sense of how I might accomplish the task you are
> describing.  I include timing information.  If this isn't a representative
> workflow, perhaps you can show us some code and timing information.
>
> Sean
>
> > # Get all human refseq accessions
> > refseqs = keys(org.Hs.eg.db,keytype="REFSEQ")
> > # Time the lookup for symbol and entrez ID
> > system.time((symbols=select(org.Hs.eg.db,keytype="REFSEQ",
> +                             keys=refseqs,
> +                             columns=c('REFSEQ','SYMBOL','ENTREZID'))))
>    user  system elapsed
>   2.170   0.071   2.259
> > head(symbols)
>         REFSEQ SYMBOL ENTREZID
> 1    NM_130786   A1BG        1
> 2    NP_570602   A1BG        1
> 3    NM_000014    A2M        2
> 4    NP_000005    A2M        2
> 5 XM_006719056    A2M        2
> 6 XP_006719119    A2M        2
>
>
>> I need a way to batch query, or preload to memory these lookup tables. I
>> tried using a hash, but checking if a value is already loaded into the
>> hash-cache is equally time consuming; and preloading the whole of
>> org.Hs.eg.db takes a few hours. I could do it once, and cache the .RData
>> object, but we're back to the local-outdated cache problem.
>>
>> So I think the only solution would be to access the sqlite underlying the
>> org.Hs.eg.db myself, so I can use the batch query. Except that db is
>> hidden
>> under the R/API of these Anno-BiMap objects like org.Hs.egSYMBOL.
>>
>> I assume this problem has been handled before, and ask for your guidance.
>>
>> Thanks
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list