[R] different randomForest performance for same data
Liaw, Andy
andy_liaw at merck.com
Tue Dec 15 15:22:04 CET 2009
You need to be _extremely_ careful when assigning levels of factors. Look at this example:
R> x1 = factor(c("a", "b", "c"))
R> x2 = factor(c("a", "c", "c"))
R> x3 = x2
R> levels(x3) <- levels(x1)
R> x3
[1] a b b
Levels: a b c
I'll try to add more XXXXproofing in the code...
Andy
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Haring, Tim (LWF)
> Sent: Thursday, December 10, 2009 5:00 AM
> To: r-help at r-project.org
> Subject: [R] different randomForest performance for same data
>
> Hello,
>
> I came across a problem when building a randomForest model.
> Maybe someone can help me.
> I have a training- and a testdataset with a discrete response
> and ten predictors (numeric and factor variables). The two
> datasets are similar in terms of number of predictor, name of
> variables and datatype of variables (factor, numeric) except
> that only one predictor has got 20 levels in the training
> dataset and only 19 levels in the test dataset.
> I found that the model performance is different when train
> and test a model with the unchanged datasets on the one hand
> and after assigning the levels of the training dataset on the
> testdataset. I only assign the levels and do not change the
> dataset itself however the models perform different.
> Why???
>
> Here is my code:
> > library(randomForest)
> > load("datasets.RData") # import traindat and testdat
> > nlevels(traindat$predictor1)
> [1] 20
> > nlevels(testdat$predictor1)
> [1] 19
> > nrow(traindat)
> [1] 9838
> > nrow(testdat)
> [1] 3841
> > set.seed(10)
> > rf_orig <- randomForest(x=traindat[,-1], y=traindat[,1],
> xtest=testdat[,-1], ytest=testdat[,1],ntree=100)
> > data.frame(rf_orig$test$err.rate)[100,1] # Error on
> test-dataset
> [1] 0.3082531
>
> # assign the levels of the training dataset th the test
> dataset for predictor 1
> > levels(testdat$predictor1) <- levels(traindat$predictor1)
> > nlevels(traindat$predictor1)
> [1] 20
> > nlevels(testdat$predictor1)
> [1] 20
> > nrow(traindat)
> [1] 9838
> > nrow(testdat)
> [1] 3841
> > set.seed(10)
> > rf_mod <- randomForest(x=traindat[,-1], y=traindat[,1],
> xtest=testdat[,-1], ytest=testdat[,1],ntree=100)
> > data.frame(rf_mod$test$err.rate)[100,1] # Error on
> test-dataset
> [1] 0.4808644 # is different
>
> Cheers,
> TIM
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Notice: This e-mail message, together with any attachme...{{dropped:10}}
More information about the R-help
mailing list