[R] Large dataset + randomForest
Kuhn, Max
Max.Kuhn at pfizer.com
Thu Jul 26 20:26:11 CEST 2007
Florian,
The first thing that you should change is how you call randomForest.
Instead of specifying the model via a formula, use the randomForest(x,
y) interface.
When a formula is used, there is a terms object created so that a model
matrix can be created for these and future observations. That terms
object can get big (I think it would be a matrix of size 151 x 150) and
is diagonal.
That might not solve it, but it should help.
Max
-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Florian Nigsch
Sent: Thursday, July 26, 2007 2:07 PM
To: r-help at stat.math.ethz.ch
Subject: [R] Large dataset + randomForest
[Please CC me in any replies as I am not currently subscribed to the
list. Thanks!]
Dear all,
I did a bit of searching on the question of large datasets but did
not come to a definite conclusion. What I am trying to do is the
following: I want to read in a dataset with approx. 100 000 rows and
approx 150 columns. The file size is ~ 33MB, which one would deem not
too big a file for R. To speed up the reading in of the file I do not
use read.table but a loop that does reading with scan() into a buffer
and some preprocessing and then adds the data into a dataframe.
When I then want to run randomForest() R complains that I cannot
allocate a vector of size 313.0 MB. I am aware that randomForest
needs all data in memory, but
1) why should that suddenly be 10 times the size of the data (I
acknowedge the need for some internal data of R, but 10 times seems a
bit too much) and
2) there is still physical memory free on the machine (in total 4GB
available, even though R is limited to 2GB if I correctly remember
the help pages - still 2GB should be enough!) - it doesn't seem to
work either with changed settings done via mem.limits(), or run-time
arguments --min-vsize --max-vsize - what do these have to be set to
to work in my case??
> rf <- randomForest(V1 ~ ., data=df[trainindices,], do.trace=5)
Error: cannot allocate vector of size 313.0 Mb
> object.size(df)/1024/1024
[1] 129.5390
Any help would be greatly appreciated,
Florian
--
Florian Nigsch <fn211 at cam.ac.uk>
Unilever Centre for Molecular Sciences Informatics
Department of Chemistry
University of Cambridge
http://www-mitchell.ch.cam.ac.uk/
Telephone: +44 (0)1223 763 073
[[alternative HTML version deleted]]
______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
----------------------------------------------------------------------
LEGAL NOTICE\ Unless expressly stated otherwise, this messag...{{dropped}}
More information about the R-help
mailing list