[R] Execution speed in randomForest

jim holtman jholtman at gmail.com
Fri Apr 6 18:45:59 CEST 2012


Are you looking at the CPU or the elapsed time?  If it is the elapsed
time, then also capture the CPU time to see if it is different.  Also
consider the use of the Rprof function to see where time is being
spent.  What else is running on the machine?  Are you doing any
paging?  What type of system are you running on?  Use some of the
system level profiling tools.  If on Windows, then use perfmon.

On Fri, Apr 6, 2012 at 11:28 AM, Jason & Caroline Shaw
<los.shaws at gmail.com> wrote:
> I am using the randomForest package.  I have found that multiple runs
> of precisely the same command can generate drastically different run
> times.  Can anyone with knowledge of this package provide some insight
> as to why this would happen and whether there's anything I can do
> about it?  Here are some details of what I'm doing:
>
> - Data: ~80,000 rows, with 10 columns (one of which is the class label)
> - I randomly select 90% of the data to use to build 500 trees.
>
> And this is what I find:
>
> - Execution times of randomForest() using the entire dataset (in
> seconds): 20.65, 20.93, 20.79, 21.05, 21.00, 21.52, 21.22, 21.22
> - Execution times of randomForest() using the 90% selection: 17.78,
> 17.74, 126.52, 241.87, 17.56, 17.97, 182.05, 17.82 <-- Note the 3rd,
> 4th, and 7th.
> - When the speed is slow, it often stutters, with one or a few trees
> being produced very quickly, followed by a slow build taking 10 or 20
> seconds
> - The oob results are indistinguishable between the fast and slow runs.
>
> I select the 90% of my data by using sample() to generate indices and
> then subsetting, like: selection <- data[sample,].  I thought perhaps
> this subsetting was getting repeated, rather than storing in memory a
> new copy of all that data, so I tried circumventing this with
> eval(data[sample,]).  Probably barking up the wrong tree -- it had no
> effect, and doesn't explain the run-to-run variation (really, I'm just
> not clear on what eval() is for).  I have also tried garbage
> collecting with gc() between each run, and adding a Sys.sleep() for 5
> seconds, but neither of these has helped either.
>
> Any ideas?
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.



More information about the R-help mailing list