[R] Execution speed in randomForest

Jason & Caroline Shaw los.shaws at gmail.com
Fri Apr 6 17:28:33 CEST 2012


I am using the randomForest package.  I have found that multiple runs
of precisely the same command can generate drastically different run
times.  Can anyone with knowledge of this package provide some insight
as to why this would happen and whether there's anything I can do
about it?  Here are some details of what I'm doing:

- Data: ~80,000 rows, with 10 columns (one of which is the class label)
- I randomly select 90% of the data to use to build 500 trees.

And this is what I find:

- Execution times of randomForest() using the entire dataset (in
seconds): 20.65, 20.93, 20.79, 21.05, 21.00, 21.52, 21.22, 21.22
- Execution times of randomForest() using the 90% selection: 17.78,
17.74, 126.52, 241.87, 17.56, 17.97, 182.05, 17.82 <-- Note the 3rd,
4th, and 7th.
- When the speed is slow, it often stutters, with one or a few trees
being produced very quickly, followed by a slow build taking 10 or 20
seconds
- The oob results are indistinguishable between the fast and slow runs.

I select the 90% of my data by using sample() to generate indices and
then subsetting, like: selection <- data[sample,].  I thought perhaps
this subsetting was getting repeated, rather than storing in memory a
new copy of all that data, so I tried circumventing this with
eval(data[sample,]).  Probably barking up the wrong tree -- it had no
effect, and doesn't explain the run-to-run variation (really, I'm just
not clear on what eval() is for).  I have also tried garbage
collecting with gc() between each run, and adding a Sys.sleep() for 5
seconds, but neither of these has helped either.

Any ideas?



More information about the R-help mailing list