[R] randomForest gives different results for formula call v. x, y methods. Why?
Gavin Simpson
gavin.simpson at ucl.ac.uk
Sun Apr 29 15:38:39 CEST 2007
On Sat, 2007-04-28 at 21:13 -0400, David L. Van Brunt, Ph.D. wrote:
> Just out of curiosity, I took the default "iris" example in the RF
> helpfile...
> but seeing the admonition against using the formula interface for large data
> sets, I wanted to play around a bit to see how the various options affected
> the output. Found something interesting I couldn't find documentation for...
>
> Just like the example...
> > set.seed(12) # to be sure I have reproducibility
No differences between runs for me on FC4 using R 2.4.1 and 2.5.0 with:
> require(randomForest)
Loading required package: randomForest
randomForest 4.5-18
*if* I reset the seed before each call to randomForest.
Your example code doesn't seem to be resetting the random seed before
each run. As such, each run is using a different set of random variables
at each bootstrap sample.
E.g. runs all same with reset seed:
> set.seed(12)
> randomForest(Species ~ ., data=iris)
Call:
randomForest(formula = Species ~ ., data = iris)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 4%
Confusion matrix:
setosa versicolor virginica class.error
setosa 50 0 0 0.00
versicolor 0 47 3 0.06
virginica 0 3 47 0.06
> set.seed(12)
> randomForest(x=iris[,1:4],y=iris[,5])
Call:
randomForest(x = iris[, 1:4], y = iris[, 5])
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 4%
Confusion matrix:
setosa versicolor virginica class.error
setosa 50 0 0 0.00
versicolor 0 47 3 0.06
virginica 0 3 47 0.06
> set.seed(12)
> randomForest(x=iris[,c(1:4)],y=iris[,5])
Call:
randomForest(x = iris[, c(1:4)], y = iris[, 5])
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 4%
Confusion matrix:
setosa versicolor virginica class.error
setosa 50 0 0 0.00
versicolor 0 47 3 0.06
virginica 0 3 47 0.06
> set.seed(12)
> randomForest(x=iris[,c(1,2,3,4)],y=iris[,5])
Call:
randomForest(x = iris[, c(1, 2, 3, 4)], y = iris[, 5])
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 4%
Confusion matrix:
setosa versicolor virginica class.error
setosa 50 0 0 0.00
versicolor 0 47 3 0.06
virginica 0 3 47 0.06
HTH
G
--
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
Gavin Simpson [t] +44 (0)20 7679 0522
ECRC [f] +44 (0)20 7679 0565
UCL Department of Geography
Pearson Building [e] gavin.simpsonATNOSPAMucl.ac.uk
Gower Street
London, UK [w] http://www.ucl.ac.uk/~ucfagls/
WC1E 6BT [w] http://www.freshwaters.org.uk/
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
More information about the R-help
mailing list