[BioC] slow query
Marc Carlson
mcarlson at fhcrc.org
Sat Mar 23 01:06:21 CET 2013
Hi Sean,
It's because you just asked for everything associated with that one
gene, multiplied by everything else. Many of the things that are going
to be associated with BRCA1 (such as pubmed IDs) have a many to one
relationship with the initial key. That means that when you add several
of these kinds of cols into your query (and you asked for ALL of them),
then the number of rows returned will be multiplied out by all the many
to one relationships.
So for example, suppose you only had asked for pubmed IDs (PMID) and
ENSEMBL IDs. And lets also suppose that there are only 4 pubmed IDs
associated with your gene, and 2 ENSEMBL IDs. How many rows would that
be? Well in that case the result should be 8 rows long. Now what
happens if you then also asked for something like UNIPROT (and lets
assume there are 5 of those)? Now your result is suddenly FORTY rows
long. See the problem? Because the answer is being returned as a
data.frame, and because there are multiple many to one relationships,
you can end up generating a really huge result when the data are
represented as a simple data.frame. One gene can suddenly actually end
up amounting to millions of rows.
That is just how the math works out when you store data that has
complicated relationships into simple flat data.frame objects. Getting
around the problem of all this wasted row-space is part of why
relational databases were invented in the 1st place, and here you are
calling select which will attempt to flatten such information for you
(because it is easier for humans to look at it that way). But as you
can see, there are good reasons why we don't actually store it that way
in the background.
So if you feel like it's taking too long, I would recommend being a
little more selective about what you ask for. You can probably get the
same data with a couple of separate requests, wait a lot less (and also
end up with much more manageable data.frames).
Marc
On 03/22/2013 01:49 PM, Sean Wang [guest] wrote:
> The query to org.Hs.eg.db is very slow.
>
> I submit the following query,
>
> cols=cols(org.Hs.eg.db)
> gns="BRCA1"
> BRCA1.info=select(org.Hs.eg.db, cols=cols, keys=gns, keytype="SYMBOL")
>
>
> It takes forever to wait for the result.
>
> Anyone knows why and please help me.
>
> Thank you.
>
> -- output of sessionInfo():
>
> (no result yet)
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list