[R] randomForest memory footprint

John Foreman john.4man at gmail.com
Wed Sep 7 20:45:59 CEST 2011


Hello, I am attempting to train a random forest model using the
randomForest package on 500,000 rows and 8 columns (7 predictors, 1
response). The data set is the first block of data from the UCI
Machine Learning Repo dataset "Record Linkage Comparison Patterns"
with the slight modification that I dropped two columns with lots of
NA's and I used knn imputation to fill in other gaps.

When I load in my dataset, R uses no more than 100 megs of RAM. I'm
running a 64-bit R with ~4 gigs of RAM available. When I execute the
randomForest() function, however I get memory complaints. Example:

> summary(mydata1.clean[,3:10])
  cmp_fname_c1     cmp_lname_c1       cmp_sex           cmp_bd
  cmp_bm           cmp_by          cmp_plz         is_match
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000
Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   FALSE:572820
 1st Qu.:0.2857   1st Qu.:0.1000   1st Qu.:1.0000   1st Qu.:0.0000
1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   TRUE :  2093
 Median :1.0000   Median :0.1818   Median :1.0000   Median :0.0000
Median :0.0000   Median :0.0000   Median :0.00000
 Mean   :0.7127   Mean   :0.3156   Mean   :0.9551   Mean   :0.2247
Mean   :0.4886   Mean   :0.2226   Mean   :0.00549
 3rd Qu.:1.0000   3rd Qu.:0.4286   3rd Qu.:1.0000   3rd Qu.:0.0000
3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000
Max.   :1.0000   Max.   :1.0000   Max.   :1.00000
> mydata1.rf.model2 <- randomForest(x = mydata1.clean[,3:9],y=mydata1.clean[,10],ntree=100)
Error: cannot allocate vector of size 877.2 Mb
In addition: Warning messages:
1: In dim(data) <- dim :
  Reached total allocation of 3992Mb: see help(memory.size)
2: In dim(data) <- dim :
  Reached total allocation of 3992Mb: see help(memory.size)
3: In dim(data) <- dim :
  Reached total allocation of 3992Mb: see help(memory.size)
4: In dim(data) <- dim :
  Reached total allocation of 3992Mb: see help(memory.size)

Other techniques such as boosted trees handle the data size just fine.
Are there any parameters I can adjust such that I can use a value of
100 or more for ntree?

Thanks,
John



More information about the R-help mailing list