[Rd] Problems when printing *large* R objects
Martin Maechler
maechler at stat.math.ethz.ch
Mon Dec 6 11:44:01 CET 2004
>>>>> "Simon" == Simon Urbanek <simon.urbanek at math.uni-augsburg.de>
>>>>> on Sun, 5 Dec 2004 19:39:07 -0500 writes:
Simon> On Dec 4, 2004, at 9:50 PM, ap_llywelyn at mac.com
Simon> wrote:
>> Source code leading to crash:
>>
>> library(cluster)
>> data(xclara)
>> plot(hclust(dist(xclara)))
>>
>> This leads to a long wait where the application is frozen
>> (spinning status bar going the entire time), a quartz
>> window displays without any content, and then the
>> following application crash occurs:
Simon> Please post this to the maintainers of the cluster
Simon> library (if at all),
Well, this is a *package*, not a library {please, please!}
And really, that has nothing to do with the 'cluster' package
(whose maintainer I am), as David only uses its data set.
hclust() and dist() are in the standard 'stats' package.
Btw, this can be accomplished more cleanly, i.e., without
attaching "cluster", by
data(xclara, package = "cluster")
Simon> this has nothing to do
Simon> with the GUI (behaves the same in X11). The above
Simon> doesn't make much sense anyway - you definitely want
Simon> to use cutree before you plot that dendogram ...
Indeed!
A bit more explicitly for David:
xclara has 3000 observations,
i.e. 3000*2999/2 ~= 4.5 Mio distances {i.e., a bit more than 36
MBytes to keep in memory and about 48 mio characters to display
when you use default options(digits=7)}.
I don't think you can really make much of printing these many
numbers onto your console as you try with
David> dist(xclara) -> xclara.dist
David> Works okay, though when attempting to display those results it freezes
David> up the entire system, probably as the result of memory
David> threshing/starvation if the top results are any indicator:
David> 1661 R 8.5% 9:36.12 3 92 567 368M+ 3.88M 350M- 828M
"freezes up the entire system" when trying to print something
too large actually has something to do with user interface.
AFAIK, it doesn't work 'nicely' on any R console,
but at least in ESS on Linux, it's just that one Emacs,
displaying the "wrist watch" (and I can easily tell emacs "not
to wait" by pressing Ctrl g"). Also, just trying it now {on a
machine with large amounts of RAM}: After pressing return, it at
least starts printing (displaying to the *R* buffer) after a bit
more than 1 minute.. and that does ``seem'' to never finish.
I can signal a break (via the [Signals] Menu or C-c C-c in
Emacs), and still have to wait about 2-3 minutes for the output
stops; but it does, and I can work on.. {well, in theory; my Emacs
seems to have become v..e..r..y s...l...o....w} We only
recently had a variation on this theme in the ESS-help mailing
list, and several people were reporting they couldn't really
stop R from printing and had to kill the R process...
So after all, there's not quite a trivial problem "hidden" in
David's report : What should happen if the user accidentally
wants to print a huge object to console... how to make sure R
can be told to stop.
And as I see it now, there's even something like an R "bug" (or
"design infelicity") here:
I've now done it again {on a compute server Dual-Opteron with 4
GB RAM}: After stopping, via the ESS [Signals] [Break (C-c C-c)] menu,
Emacs stops immediately, but R doesn't return quickly,
and rather, watching "top" {the good ol' unix process monitor} I
see R using 99.9% CPU and it's memory footage ("VIRT" and
"SHR") increasing and increasing..., upto '1081m', a bit more
than 1 GB, when R finally returns (displays the prompt) after
only a few minutes --- but then, as said, this is on a remote
64bit machine with 4000 MB RAM.
BTW, when I then remove the 'dist' (and hclust) objects in R,
and type gc(),
(or maybe do some other things in R; the R process has about
halfed its apparent memory usage to 500something MB.
more stats:
during printing: 798 m
after "break" : 798, for ~5 seconds, then starting to
grow; slowly (in my top, in steps of ~ 10m)
upto 1076m
then the R prompt is displayed and top shows "1081m".
It stays there , until I do
> gc()
where it goes down to VIRT 841m (RES 823m)
and after removing the large distance object, and gc() again,
it lowers to 820m (RES 790m) and stays there.
Probably this thread should be moved to R-devel -- and hence I
crosspost for once.
Martin
More information about the R-devel
mailing list