[R-sig-hpc] distributed performance monitoring

Ross Boylan ross at biostat.ucsf.edu
Fri Jan 10 03:32:17 CET 2014


I'm doing a distributed MIMD computation on a cluster, and would like to 
be able to tell where it is bottlenecking.  For example, there are many 
simulator processes, but only a few coefficient server processes.  If 
the coefficient servers are saturated I can add more, but I need to know 
if they are saturated.

This is using Rmpi on a Debian cluster.

My first thought was that each process could record logs of the time it 
entered various states, and I could then look at these 2 see if they 
were idle.  E.g., I could log the message tag and source or destination 
for each message send, along with a timestamp.  2 concerns are what an 
efficient data structure for a log is (my understanding is that rbind 
can be both slow and memory intensive) and whether getting the timestamp 
could itself be an expensive operation.

My second thought was that this is probably already a solved problem, 
but I'm not sure where to look.  Some of R parallel libraries (the 
example I saw might have been snow) can produce basic graphs showing 
when processes were active.  And there are probably MPI level facilities 
for this sort of thing too (we're using openmpi).

Any pointers or suggestions?

Thanks.
Ross Boylan



More information about the R-sig-hpc mailing list