[Bioc-devel] faster gene id conversion?

Mon Nov 24 18:16:37 CET 2014

On Mon, Nov 24, 2014 at 12:02 PM, Karl Stamm <karl.stamm at gmail.com> wrote:

> Thanks everyone! I wasn't familiar with the select() function, it appears
> to be specific to the AnnotationDb package, and not referenced by the
> examples of org.Hs.eg.db, where I was looking. Those examples for things
> like org.Hs.eg.dbREFSEQ call the list accessor or a vector accessor, and
> always for the full list.
> mapped_seqs <- mappedkeys(x)
> xx <- as.list(x[mapped_seqs])
> xx[[1]]
>
> Following that pattern, I developed a big slow DisplayName() with tryCatch
> for null values, and a double lookup to go from Refseq to EG to HGNC.
>
> I've reimplemented following your advice (Thanks Sean Davis and Vincent
> Carey). Examples with benchmarking are posted here:
> http://pastebin.com/HjNbtD1g
>
> Calling select() with a short vector of refseq IDs takes a similar amount
> of time, between a half second and a full second for all three methods.
> It's not an improvement. As my use-case is lots of calls for a few IDs,
> this is still a problem.
>
> However, my method looped the input ids and called one at a time, so
> runtime is linear, whereas the select() methods are nearly constant time,
> calling for fifty ids is therefore MUCH faster.  Calling the whole
> database, as Sean showed, only uses a few seconds. This becomes an
> opportunity to pre-cache, as I tried to do, but with the linear algorithm
> needed more than an hour at package load time.
>
> A fourth version of the function, using the select() to eat 3 seconds and
> fill a hash(), subsequently blows the other methods out of the water. For
> example, to pull 50 genes with my slow method was 7 seconds, by
> select(org.Hs.eg.db) was 0.8 seconds, by select(Homo.sapiens) was 1.6
> seconds, and by hash() is 0.003 seconds. (microbenchmark median of 30
> evals).
>

Hi, Karl.

You might consider adding a list cache mechanism as well.  The hash() is
based on environments, if I recall.  Last I checked, lists could be faster
for smaller datasets.

Sean

> Thanks to your help I can fill the hash in a reasonable time, and save it
> on package load. Next step, I have to learn how to use the package
> onLoad().
>
>
>
> On Sat, Nov 22, 2014 at 5:32 AM, Sean Davis <seandavi at gmail.com> wrote:
>
> >
> >
> > On Sat, Nov 22, 2014 at 12:53 AM, Karl Stamm <karl.stamm at gmail.com>
> wrote:
> >
> >> Question regarding gene name conversions. Once upon a time, I was doing
> a
> >> lot of gene name conversions, particularly from NM_#### to HGNC symbol
> or
> >> Entrez GeneID. I used bioMaRt successfully, and developed a cache matrix
> >> so
> >> I could quickly merge() it instead of calling out to a webservice
> >> repeatedly. Later the complexity of keeping the cache updated became
> >> overwhelming, and carrying around a few megabytes of possibly outdated
> >> identifiers is a bad idea. Per Bioconductor guidelines, I switched to
> the
> >> built in annotation packages. Now I'm using org.Hs.eg.db's lookup lists
> >> org.Hs.egREFSEQ2EG and org.Hs.egSYMBOL.
> >>
> >> These sometimes map to multiple values and sometimes map to nothing,
> >> causing errors in my code. To clean it up, I wrapped their accessors
> with
> >> some error checking. Things work again, assigning one human readable
> name
> >> per transcript ID#. Problem is this method is very slow. I thought it
> >> could
> >> be the error checking code, but even trying to streamline that doesn't
> >> help. A profiler showed that most of my time was spent in .Call,
> actually
> >> it turns out each access to the "list" like this
> org.Hs.egSYMBOL[[eg]][1]
> >> was calling a sqlite query. Since I am nesting these calls in a loop,
> (NM
> >> to EG to HGNC, a few thousands of times), these copious calls out to
> >> sqlite
> >> are killing me.
> >>
> >
> > Hi, Karl.
> >
> > It is a little hard to diagnose problems without code, but here is a
> > little code to get a sense of how I might accomplish the task you are
> > describing.  I include timing information.  If this isn't a
> representative
> > workflow, perhaps you can show us some code and timing information.
> >
> > Sean
> >
> > > # Get all human refseq accessions
> > > refseqs = keys(org.Hs.eg.db,keytype="REFSEQ")
> > > # Time the lookup for symbol and entrez ID
> > > system.time((symbols=select(org.Hs.eg.db,keytype="REFSEQ",
> > +                             keys=refseqs,
> > +                             columns=c('REFSEQ','SYMBOL','ENTREZID'))))
> >    user  system elapsed
> >   2.170   0.071   2.259
> > > head(symbols)
> >         REFSEQ SYMBOL ENTREZID
> > 1    NM_130786   A1BG        1
> > 2    NP_570602   A1BG        1
> > 3    NM_000014    A2M        2
> > 4    NP_000005    A2M        2
> > 5 XM_006719056    A2M        2
> > 6 XP_006719119    A2M        2
> >
> >
> >> I need a way to batch query, or preload to memory these lookup tables. I
> >> tried using a hash, but checking if a value is already loaded into the
> >> hash-cache is equally time consuming; and preloading the whole of
> >> org.Hs.eg.db takes a few hours. I could do it once, and cache the .RData
> >> object, but we're back to the local-outdated cache problem.
> >>
> >> So I think the only solution would be to access the sqlite underlying
> the
> >> org.Hs.eg.db myself, so I can use the batch query. Except that db is
> >> hidden
> >> under the R/API of these Anno-BiMap objects like org.Hs.egSYMBOL.
> >>
> >> I assume this problem has been handled before, and ask for your
> guidance.
> >>
> >> Thanks
> >>
> >>         [[alternative HTML version deleted]]
> >>
> >> _______________________________________________
> >> Bioc-devel at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>
> >
> >
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]