[R] different randomForest performance for same data

Tue Dec 15 15:22:04 CET 2009

You need to be _extremely_ careful when assigning levels of factors.  Look at this example:

R> x1 = factor(c("a", "b", "c"))
R> x2 = factor(c("a", "c", "c"))
R> x3 = x2
R> levels(x3) <- levels(x1)
R> x3
[1] a b b
Levels: a b c

I'll try to add more XXXXproofing in the code... 

Andy

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Haring, Tim (LWF)
> Sent: Thursday, December 10, 2009 5:00 AM
> To: r-help at r-project.org
> Subject: [R] different randomForest performance for same data
> 
> Hello,
> 
> I came across a problem when building a randomForest model. 
> Maybe someone can help me.
> I have a training- and a testdataset with a discrete response 
> and ten predictors (numeric and factor variables). The two 
> datasets are similar in terms of number of predictor, name of 
> variables and datatype of variables (factor, numeric) except 
> that only one predictor has got 20 levels in the training 
> dataset and only 19 levels in the test dataset.
> I found that the model performance is different when train 
> and test a model with the unchanged datasets on the one hand 
> and after assigning the levels of the training dataset on the 
> testdataset. I only assign the levels and do not change the 
> dataset itself however the models perform different.
> Why???
> 
> Here is my code:
> > library(randomForest)
> > load("datasets.RData")  # import traindat and testdat
> > nlevels(traindat$predictor1)
> [1] 20
> > nlevels(testdat$predictor1)
> [1] 19
> > nrow(traindat)
> [1] 9838
> > nrow(testdat)
> [1] 3841
> > set.seed(10)
> > rf_orig <- randomForest(x=traindat[,-1], y=traindat[,1], 
> xtest=testdat[,-1], ytest=testdat[,1],ntree=100)
> > data.frame(rf_orig$test$err.rate)[100,1]      # Error on 
> test-dataset
> [1] 0.3082531
> 
> # assign the levels of the training dataset th the test 
> dataset for predictor 1
> > levels(testdat$predictor1) <- levels(traindat$predictor1)  
> > nlevels(traindat$predictor1)
> [1] 20
> > nlevels(testdat$predictor1)
> [1] 20
> > nrow(traindat)
> [1] 9838
> > nrow(testdat)
> [1] 3841
> > set.seed(10)
> > rf_mod <- randomForest(x=traindat[,-1], y=traindat[,1], 
> xtest=testdat[,-1], ytest=testdat[,1],ntree=100)
> > data.frame(rf_mod$test$err.rate)[100,1]       # Error on 
> test-dataset
> [1] 0.4808644  # is different
> 
> Cheers,
> TIM
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
Notice:  This e-mail message, together with any attachme...{{dropped:10}}