[Rd] Problems when printing *large* R objects

Martin Maechler maechler at stat.math.ethz.ch
Mon Dec 6 11:44:01 CET 2004


>>>>> "Simon" == Simon Urbanek <simon.urbanek at math.uni-augsburg.de>
>>>>>     on Sun, 5 Dec 2004 19:39:07 -0500 writes:

    Simon> On Dec 4, 2004, at 9:50 PM, ap_llywelyn at mac.com
    Simon> wrote:
    >> Source code leading to crash:
    >> 
    >> library(cluster)
    >> data(xclara)
    >> plot(hclust(dist(xclara)))
    >> 
    >> This leads to a long wait where the application is frozen
    >> (spinning status bar going the entire time), a quartz
    >> window displays without any content, and then the
    >> following application crash occurs:

    Simon> Please post this to the maintainers of the cluster
    Simon> library (if at all),

Well, this is a *package*, not a library {please, please!}

And really, that has nothing to do with the 'cluster' package
(whose maintainer I am), as David only uses its data set.
hclust() and dist() are in the standard 'stats' package.

Btw, this can be accomplished more cleanly, i.e., without
attaching "cluster", by 

  data(xclara, package = "cluster")


    Simon> this has nothing to do
    Simon> with the GUI (behaves the same in X11).  The above
    Simon> doesn't make much sense anyway - you definitely want
    Simon> to use cutree before you plot that dendogram ...

Indeed!  

A bit more explicitly for David:
xclara has 3000 observations, 
i.e. 3000*2999/2 ~= 4.5 Mio distances {i.e., a bit more than 36
MBytes to keep in memory and about 48 mio characters to display
when you use default options(digits=7)}.
I don't think you can really make much of printing these many
numbers onto your console as you try with

    David> dist(xclara) -> xclara.dist

    David> Works okay, though when attempting to display those results it freezes  
    David> up the entire system, probably as  the result of memory  
    David> threshing/starvation if the top results are any indicator:

    David> 1661 R   8.5%  9:36.12   3    92   567   368M+ 3.88M   350M-  828M

"freezes up the entire system"  when trying to print something
too large actually has something to do with user interface.
AFAIK, it doesn't work 'nicely' on any R console,
but at least in ESS on Linux, it's just that one Emacs,
displaying the "wrist watch" (and I can easily tell emacs "not
to wait" by pressing Ctrl g").  Also, just trying it now {on a
machine with large amounts of RAM}: After pressing return, it at
least starts printing (displaying to the *R* buffer) after a bit
more than 1 minute.. and that does ``seem'' to never finish.
I can signal a break (via the [Signals] Menu or C-c C-c in
Emacs), and still have to wait about 2-3 minutes for the output
stops; but it does, and I can work on.. {well, in theory; my Emacs
seems to have become v..e..r..y  s...l...o....w}  We only
recently had a variation on this theme in the ESS-help mailing
list, and several people were reporting they couldn't really
stop R from printing and had to kill the R process...

So after all, there's not quite a trivial problem "hidden" in
David's report :  What should happen if the user accidentally
wants to print a huge object to console... how to make sure R
can be told to stop.

And as I see it now, there's even something like an R "bug" (or
"design infelicity") here:

I've now done it again {on a compute server Dual-Opteron with 4
GB RAM}:  After stopping, via the ESS [Signals] [Break (C-c C-c)] menu,
   Emacs stops immediately, but R doesn't return quickly,
and rather, watching "top" {the good ol' unix process monitor} I
see R using 99.9% CPU and it's memory footage ("VIRT" and 
"SHR") increasing and increasing..., upto '1081m', a bit more
than 1 GB, when R finally returns (displays the prompt) after
only a few minutes --- but then, as said, this is on a remote
64bit machine with 4000 MB RAM.

BTW, when I then remove the 'dist' (and hclust) objects in R,
and type  gc(),
(or maybe do some other things in R; the R process has about
halfed its apparent memory usage to 500something MB.  

more stats: 
     during printing:  798 m
     after "break"  :  798, for ~5 seconds, then starting to
		       grow; slowly (in my top, in steps of ~ 10m)
		       upto 1076m
         then the R prompt is displayed and top shows "1081m".

It stays there , until I do  
   > gc()
where it goes down to VIRT 841m (RES 823m)
and after removing the large distance object, and gc() again,
it lowers to 820m (RES 790m) and stays there.

Probably this thread should be moved to R-devel -- and hence I
crosspost for once.

Martin



More information about the R-devel mailing list