[R] Large dataset + randomForest
Florian Nigsch
fn211 at cam.ac.uk
Fri Jul 27 16:47:18 CEST 2007
I compiled the newest R version on a Redhat Linux (uname -a =
Linux .cam.ac.uk 2.4.21-50.ELsmp #1 SMP Tue May 8 17:18:29 EDT 2007
i686 i686 i386 GNU/Linux) with 4GB of physical memory. The step when
the whole script crashed is within the randomForest() routine, I do
know that because I want to time it thus I have it inside a
system.time() call. This function exits with the error I posted earlier:
> rf <- randomForest(V1 ~ ., data=df[trainindices,], do.trace=5)
Error: cannot allocate vector of size 313.0 Mb
When calling gc() directly before I call randomForest() and after I
get this:
> gc()
used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 255416 6.9 899071 24.1 16800.0 818163 21.9
Vcells 17874469 136.4 90854072 693.2 4000.1 269266598 2054.4
> rf <- randomForest(V1 ~ ., data=df, subset=trainindices, do.trace=5)
Error: cannot allocate vector of size 626.1 Mb
> gc()
used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 255441 6.9 899071 24.1 16800.0 818163 21.9
Vcells 17874541 136.4 112037674 854.8 4000.1 269266598 2054.4
>
So the only real difference is in the "gc trigger" and the "(Mb)"
column next to it. By the way, I am not running it in GUI mode
On 27 Jul 2007, at 13:17, jim holtman wrote:
> At the max, you had 2GB of memory being used. What operating system
> are you running on and how much physical memory do you have on your
> system? For windows, there are parameters on the command line to
> start RGUI that let you define how much memory can be used. I am not
> sure of Linus/UNIX. So you are probably hitting the 2GB max and then
> you don't have any more physical memory available. If the computation
> is a long script, you might put some 'gc()' statements in the code to
> see what section is using the most memory.
>
> Your problem might have to be broken into parts to run.
>
> On 7/27/07, Florian Nigsch <fn211 at cam.ac.uk> wrote:
>> Hi Jim,
>>
>> Here is the output of gc() of the same session of R (that I have
>> still running...)
>>
>>> gc()
>> used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
>> Ncells 255416 6.9 899071 24.1 16800.0 818163 21.9
>> Vcells 17874469 136.4 113567591 866.5 4000.1 269266598 2054.4
>>
>> By increasing the limit of vcells and ncells to 1GB (if that is
>> possible?!), would that perhaps solve my problem?
>>
>> Cheers,
>>
>> Florian
More information about the R-help
mailing list