[R-sig-hpc] distributed performance monitoring
ross at biostat.ucsf.edu
Fri Jan 10 03:32:17 CET 2014
I'm doing a distributed MIMD computation on a cluster, and would like to
be able to tell where it is bottlenecking. For example, there are many
simulator processes, but only a few coefficient server processes. If
the coefficient servers are saturated I can add more, but I need to know
if they are saturated.
This is using Rmpi on a Debian cluster.
My first thought was that each process could record logs of the time it
entered various states, and I could then look at these 2 see if they
were idle. E.g., I could log the message tag and source or destination
for each message send, along with a timestamp. 2 concerns are what an
efficient data structure for a log is (my understanding is that rbind
can be both slow and memory intensive) and whether getting the timestamp
could itself be an expensive operation.
My second thought was that this is probably already a solved problem,
but I'm not sure where to look. Some of R parallel libraries (the
example I saw might have been snow) can produce basic graphs showing
when processes were active. And there are probably MPI level facilities
for this sort of thing too (we're using openmpi).
Any pointers or suggestions?
More information about the R-sig-hpc