[R] Execution speed in randomForest
Jason & Caroline Shaw
los.shaws at gmail.com
Fri Apr 6 19:20:09 CEST 2012
The CPU time and elapsed time are essentially identical. (That is, the
system time is negligible.)
Using Rprof, I just ran the code twice. The first time, while
randomForest is doing its thing, there are 850 consecutive lines which
read:
".C" "randomForest.default" "randomForest" "randomForest.formula" "randomForest"
Upon running it a second time, this time taking 285 seconds to
complete, there are 14201 such lines, with nothing intervening
There shouldn't be interference from elsewhere on the machine. This
is the only memory- and CPU-intensive process. I don't know how to
check what kind of paging is going on, but since the machine has 16GB
of memory and I am using maybe 3 or 4 at most, I hope paging is not an
issue.
I'm on a CentOS 5 box running R 2.15.0.
On Fri, Apr 6, 2012 at 12:45 PM, jim holtman <jholtman at gmail.com> wrote:
> Are you looking at the CPU or the elapsed time? If it is the elapsed
> time, then also capture the CPU time to see if it is different. Also
> consider the use of the Rprof function to see where time is being
> spent. What else is running on the machine? Are you doing any
> paging? What type of system are you running on? Use some of the
> system level profiling tools. If on Windows, then use perfmon.
>
> On Fri, Apr 6, 2012 at 11:28 AM, Jason & Caroline Shaw
> <los.shaws at gmail.com> wrote:
>> I am using the randomForest package. I have found that multiple runs
>> of precisely the same command can generate drastically different run
>> times. Can anyone with knowledge of this package provide some insight
>> as to why this would happen and whether there's anything I can do
>> about it? Here are some details of what I'm doing:
>>
>> - Data: ~80,000 rows, with 10 columns (one of which is the class label)
>> - I randomly select 90% of the data to use to build 500 trees.
>>
>> And this is what I find:
>>
>> - Execution times of randomForest() using the entire dataset (in
>> seconds): 20.65, 20.93, 20.79, 21.05, 21.00, 21.52, 21.22, 21.22
>> - Execution times of randomForest() using the 90% selection: 17.78,
>> 17.74, 126.52, 241.87, 17.56, 17.97, 182.05, 17.82 <-- Note the 3rd,
>> 4th, and 7th.
>> - When the speed is slow, it often stutters, with one or a few trees
>> being produced very quickly, followed by a slow build taking 10 or 20
>> seconds
>> - The oob results are indistinguishable between the fast and slow runs.
>>
>> I select the 90% of my data by using sample() to generate indices and
>> then subsetting, like: selection <- data[sample,]. I thought perhaps
>> this subsetting was getting repeated, rather than storing in memory a
>> new copy of all that data, so I tried circumventing this with
>> eval(data[sample,]). Probably barking up the wrong tree -- it had no
>> effect, and doesn't explain the run-to-run variation (really, I'm just
>> not clear on what eval() is for). I have also tried garbage
>> collecting with gc() between each run, and adding a Sys.sleep() for 5
>> seconds, but neither of these has helped either.
>>
>> Any ideas?
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.
More information about the R-help
mailing list