[BioC] slow graph methods [was: 'slow insertions in to graphNEL object (24 hours for 16k nodes)']

Wed Sep 26 23:36:49 CEST 2007

Dear Paul,

there is almost surely scope for making performance increases in any of 
these functions (and patches are welcome), but it is also worth noting 
that the fact that R is a functional language with mostly a 
"pass-by-value" semantics will often put limits to this. A syntax like

  nodeData(g, "node1", "flavour") = "strawberry"

really evaluates to

  g = "nodeData<-"(g, "node1", "flavour", "strawberry")

and a full copy of g is made and eventually destroyed in memory each 
time you call this. Similarly for addNode etc.

Until someone writes more performant graph classes (note that 
"graphNEL", which you are probably using, is only one of the several 
subclasses of "graph"), a pragmatic workaround is:

1. If you know the node set in advance, your edges do not have complex 
attributes, and you just want to incrementally add and remove edges, you 
could use the base-R matrix class to store the graph's adjacency matrix, 
and only in the end convert it to a "graph" object (see the man page "? 
ftM2graphNEL"). I believe some of the native data types in R such as 
numeric vectors and matrices have much more efficient replacement 
methods "[<-" than what I said above for "nodeData<-" - they are able to 
change an object in situ.

2. Alternatively, you could use an environment to store objects 
describing your nodes and edges. Adding and removing things from an 
environment is very efficient / cheap. Again, once you are done, you 
could convert into the cleaner and more robst S4 graph object.

    Best wishes
	Wolfgang

------------------------------------------------------------------
Wolfgang Huber  EBI/EMBL  Cambridge UK  http://www.ebi.ac.uk/huber

Shannon ha scritto:
> A couple of weeks ago, just before he left, Seth made a striking (800x)
> improvement to the speed of graph::fromGXL.  This has been a big help.
> 
> However, I'd like to construct graphs incrementally, and  
> programmatically,
> without having to create new gxl files.  Is it possible that
> Seth's efficiencies have not yet been applied to other methods in the  
> graph class?
> 
> In particular, it seems that the following operations take a longish  
> time
> on graphs of a few thousand nodes:
> 
>    edgeData
>    nodeData
>    addEdge
>    addNode
> 
> I could provide actual times and test cases if that would help.
> 
> Thanks!
> 
>   - Paul
> 
> 
> 
> On Sep 12, 2007, at 8:09 AM, Seth Falcon wrote:
> 
>> Hi again,
>>
>> Seth Falcon <sfalcon at FHCRC.ORG> writes:
>>> Paul Shannon <pshannon at systemsbiology.org> writes:
>>>> It took nearly 24 hours (!) to create a 16k node graph using two
>>>> different techniques:
>>>>
>>>>     g = fromGXL (file ('someFile.gxl'))
>> Using a patched version of fromGXL (on my laptop) I get:
>>
>>> library(graph)
>>> con = file("kegg.yeast.gxl", open="r")
>>> system.time(z <- fromGXL(con))
>>        user  system elapsed
>>     104.366   0.570 105.070
>>> z
>>     A graphNEL graph with undirected edges
>>     Number of Nodes = 15158
>>     Number of Edges = 32668
>>> validObject(z)
>>     [1] TRUE
>>
>> That's over 800x faster :-)
>>
>> I've checked in the changes in devel as graph 1.15.17.  You can get it
>> from svn or wait till this time tomorrow (in either case, you will
>> need R 2.6.0 alpha).
>>
>> The code passes the existing unit tests, but some extra scrutiny that
>> the returned graphs are as desired is quite welcome.
>>
>> + seth
>>
>> -- 
>> Seth Falcon | Computational Biology | Fred Hutchinson Cancer  
>> Research Center
>> BioC: http://bioconductor.org/
>> Blog: http://userprimary.net/user/
>