[R] Large dataset + randomForest

Florian Nigsch fn211 at cam.ac.uk
Fri Jul 27 16:47:18 CEST 2007


I compiled the newest R version on a Redhat Linux (uname -a =  
Linux .cam.ac.uk 2.4.21-50.ELsmp #1 SMP Tue May 8 17:18:29 EDT 2007  
i686 i686 i386 GNU/Linux) with 4GB of physical memory. The step when  
the whole script crashed is within the randomForest() routine, I do  
know that because I want to time it thus I have it inside a  
system.time() call. This function exits with the error I posted earlier:

 > rf <- randomForest(V1 ~ ., data=df[trainindices,], do.trace=5)
Error: cannot allocate vector of size 313.0 Mb

When calling gc() directly before I call randomForest() and after I  
get this:

 > gc()
            used  (Mb) gc trigger  (Mb) limit (Mb)  max used   (Mb)
Ncells   255416   6.9     899071  24.1    16800.0    818163   21.9
Vcells 17874469 136.4   90854072 693.2     4000.1 269266598 2054.4
 > rf <- randomForest(V1 ~ ., data=df, subset=trainindices, do.trace=5)
Error: cannot allocate vector of size 626.1 Mb
 > gc()
            used  (Mb) gc trigger  (Mb) limit (Mb)  max used   (Mb)
Ncells   255441   6.9     899071  24.1    16800.0    818163   21.9
Vcells 17874541 136.4  112037674 854.8     4000.1 269266598 2054.4
 >

So the only real difference is in the "gc trigger" and the "(Mb)"  
column next to it. By the way, I am not running it in GUI mode


On 27 Jul 2007, at 13:17, jim holtman wrote:

> At the max, you had 2GB of memory being used.  What operating system
> are you running on and how much physical memory do you have on your
> system?  For windows, there are parameters on the command line to
> start RGUI that let you define how much memory can be used.  I am not
> sure of Linus/UNIX.  So you are probably hitting the 2GB max and then
> you don't have any more physical memory available.  If the computation
> is a long script, you might put some 'gc()' statements in the code to
> see what section is using the most memory.
>
> Your problem might have to be broken into parts to run.
>
> On 7/27/07, Florian Nigsch <fn211 at cam.ac.uk> wrote:
>> Hi Jim,
>>
>> Here is the output of gc() of the same session of R (that I have
>> still running...)
>>
>>> gc()
>>            used  (Mb) gc trigger  (Mb) limit (Mb)  max used   (Mb)
>> Ncells   255416   6.9     899071  24.1    16800.0    818163   21.9
>> Vcells 17874469 136.4  113567591 866.5     4000.1 269266598 2054.4
>>
>> By increasing the limit of vcells and ncells to 1GB (if that is
>> possible?!), would that perhaps solve my problem?
>>
>> Cheers,
>>
>> Florian



More information about the R-help mailing list