[BioC] slow query

Marc Carlson mcarlson at fhcrc.org
Sat Mar 23 01:06:21 CET 2013

Hi Sean,

It's because you just asked for everything associated with that one 
gene, multiplied by everything else. Many of the things that are going 
to be associated with BRCA1 (such as pubmed IDs) have a many to one 
relationship with the initial key.  That means that when you add several 
of these kinds of cols into your query (and you asked for ALL of them), 
then the number of rows returned will be multiplied out by all the many 
to one relationships.

So for example, suppose you only had asked for pubmed IDs (PMID) and 
ENSEMBL IDs.  And lets also suppose that there are only 4 pubmed IDs 
associated with your gene, and 2 ENSEMBL IDs.  How many rows would that 
be?  Well in that case the result should be 8 rows long.  Now what 
happens if you then also asked for something like UNIPROT (and lets 
assume there are 5 of those)?  Now your result is suddenly FORTY rows 
long.  See the problem?  Because the answer is being returned as a 
data.frame, and because there are multiple many to one relationships, 
you can end up generating a really huge result when the data are 
represented as a simple data.frame.  One gene can suddenly actually end 
up amounting to millions of rows.

That is just how the math works out when you store data that has 
complicated relationships into simple flat data.frame objects. Getting 
around the problem of all this wasted row-space is part of why 
relational databases were invented in the 1st place, and here you are 
calling select which will attempt to flatten such information for you 
(because it is easier for humans to look at it that way).  But as you 
can see, there are good reasons why we don't actually store it that way 
in the background.

So if you feel like it's taking too long, I would recommend being a 
little more selective about what you ask for.  You can probably get the 
same data with a couple of separate requests, wait a lot less (and also 
end up with much more manageable data.frames).


On 03/22/2013 01:49 PM, Sean Wang [guest] wrote:
> The query to org.Hs.eg.db is very slow.
> I submit the following query,
> cols=cols(org.Hs.eg.db)
> gns="BRCA1"
> BRCA1.info=select(org.Hs.eg.db, cols=cols, keys=gns, keytype="SYMBOL")
> It takes forever to wait for the result.
> Anyone knows why and please help me.
> Thank you.
>   -- output of sessionInfo():
> (no result yet)
> --
> Sent via the guest posting facility at bioconductor.org.
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

More information about the Bioconductor mailing list