[BioC] slow graph methods [was: 'slow insertions in to graphNEL object (24 hours for 16k nodes)']
Wolfgang Huber
huber at ebi.ac.uk
Wed Sep 26 23:36:49 CEST 2007
Dear Paul,
there is almost surely scope for making performance increases in any of
these functions (and patches are welcome), but it is also worth noting
that the fact that R is a functional language with mostly a
"pass-by-value" semantics will often put limits to this. A syntax like
nodeData(g, "node1", "flavour") = "strawberry"
really evaluates to
g = "nodeData<-"(g, "node1", "flavour", "strawberry")
and a full copy of g is made and eventually destroyed in memory each
time you call this. Similarly for addNode etc.
Until someone writes more performant graph classes (note that
"graphNEL", which you are probably using, is only one of the several
subclasses of "graph"), a pragmatic workaround is:
1. If you know the node set in advance, your edges do not have complex
attributes, and you just want to incrementally add and remove edges, you
could use the base-R matrix class to store the graph's adjacency matrix,
and only in the end convert it to a "graph" object (see the man page "?
ftM2graphNEL"). I believe some of the native data types in R such as
numeric vectors and matrices have much more efficient replacement
methods "[<-" than what I said above for "nodeData<-" - they are able to
change an object in situ.
2. Alternatively, you could use an environment to store objects
describing your nodes and edges. Adding and removing things from an
environment is very efficient / cheap. Again, once you are done, you
could convert into the cleaner and more robst S4 graph object.
Best wishes
Wolfgang
------------------------------------------------------------------
Wolfgang Huber EBI/EMBL Cambridge UK http://www.ebi.ac.uk/huber
Shannon ha scritto:
> A couple of weeks ago, just before he left, Seth made a striking (800x)
> improvement to the speed of graph::fromGXL. This has been a big help.
>
> However, I'd like to construct graphs incrementally, and
> programmatically,
> without having to create new gxl files. Is it possible that
> Seth's efficiencies have not yet been applied to other methods in the
> graph class?
>
> In particular, it seems that the following operations take a longish
> time
> on graphs of a few thousand nodes:
>
> edgeData
> nodeData
> addEdge
> addNode
>
> I could provide actual times and test cases if that would help.
>
> Thanks!
>
> - Paul
>
>
>
> On Sep 12, 2007, at 8:09 AM, Seth Falcon wrote:
>
>> Hi again,
>>
>> Seth Falcon <sfalcon at FHCRC.ORG> writes:
>>> Paul Shannon <pshannon at systemsbiology.org> writes:
>>>> It took nearly 24 hours (!) to create a 16k node graph using two
>>>> different techniques:
>>>>
>>>> g = fromGXL (file ('someFile.gxl'))
>> Using a patched version of fromGXL (on my laptop) I get:
>>
>>> library(graph)
>>> con = file("kegg.yeast.gxl", open="r")
>>> system.time(z <- fromGXL(con))
>> user system elapsed
>> 104.366 0.570 105.070
>>> z
>> A graphNEL graph with undirected edges
>> Number of Nodes = 15158
>> Number of Edges = 32668
>>> validObject(z)
>> [1] TRUE
>>
>> That's over 800x faster :-)
>>
>> I've checked in the changes in devel as graph 1.15.17. You can get it
>> from svn or wait till this time tomorrow (in either case, you will
>> need R 2.6.0 alpha).
>>
>> The code passes the existing unit tests, but some extra scrutiny that
>> the returned graphs are as desired is quite welcome.
>>
>> + seth
>>
>> --
>> Seth Falcon | Computational Biology | Fred Hutchinson Cancer
>> Research Center
>> BioC: http://bioconductor.org/
>> Blog: http://userprimary.net/user/
>
More information about the Bioconductor
mailing list