[R-sig-hpc] Excessive network traffic and slow code execution when running R via qsub on an HPC
Rainer M Krug
Rainer at krugs.de
Thu Jan 16 10:15:54 CET 2014
-----BEGIN PGP SIGNED MESSAGE-----
On 01/15/14, 23:13 , Chris Davis wrote:
> I'm trying to debug a problem where jobs running R code that I
> submit via qsub are running very slow and causing significant
> traffic between the head node and the nodes that the jobs are
> running on. The more instances of R I run, the more they
> collectively slow down.
> Much of the documentation I've found on running R on an HPC
> recommends using "module load R". Trying this led to the same
> problem as the R install referred to in the module is on the head
> node. When I did "lsof -u MyUserName" on the head node, I could
> see that there were open files related to the R executable and the
> libraries. These files were open multiple times, based on the
> number of running instances of R.
> I then installed a new version of R in the /tmp directory of the
> head node, zipped it all up, and then tried running jobs that would
> copy this all to the /tmp directory of the nodes, unzip it, and run
> a version of R locally. The same symptoms occurred. This time,
> running lsof on the head node didn't show any files directly
> related to the R install, but I do see a reference to my
> $PBS_O_WORKDIR, and there's a mention of several *.so libraries
> located in lib directories like /.rootfs/el6.4-1/lib64/. As I
> understand, these libraries are on the head node. The output of
> lsof also shows that the R code isn't spending its time
> reading/writing to user-defined files on the head node.
> The bash scripts I'm using to submit the jobs via qsub are
> slightly modified from ones that I used in the past to run java
> programs. Instead of "java -jar...", I'm now doing
> "/tmp/R-3.0.2/bin/Rscript --vanilla MyCode.R $JOB_NUMBER". The R
> code loads the necessary libraries on startup, and only reads and
> writes from the local node. Processing time when things are working
> ok (tested on a single computer) are around several hours, and the
> code runs on a single CPU, so no parallel processing is involved.
> In terms of libraries, I'm just using sqldf aside from the default
> R libraries, and as I understand from the sqldf documentation, the
> database is stored in memory, so there should be no network traffic
> associated with queries that are running.
> Does anyone have ideas of how I can further debug this? I'm not
> sure what else to check for and haven't been able to find a mention
> on mailing lists of people running into the same problems with
> network traffic slowing down running instances of R.
I did substantial number of simulations on R via qsub, and did not
experience slowdowns caused by network traffic, although I tried to
minimise it by doing exactly what you were trying to do: copying (in
my case) data to the nodes.I copied the data to $TMPDIR which is
*local* on the node and not linked.
I assume, that the directory you are copying to is actiually only
linked to the nodes (make sure with the HPC admin which folders are
linked and which are local).
On the other hand, My R installation was always in my home directory,
i.e. not on the nodes, but only linked. So I really don't think that
the network traffic is because of the R instalation or something else.
> [[alternative HTML version deleted]]
> _______________________________________________ R-sig-hpc mailing
> list R-sig-hpc at r-project.org
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation
Biology, UCT), Dipl. Phys. (Germany)
Centre of Excellence for Invasion Biology
Tel : +33 - (0)9 53 10 27 44
Cell: +33 - (0)6 85 62 59 98
Fax : +33 - (0)9 58 10 27 44
Fax (D): +49 - (0)3 21 21 25 22 44
email: Rainer at krugs.de
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
-----END PGP SIGNATURE-----
More information about the R-sig-hpc