[R] Predictor Importance in Random Forests and bootstrap

Tue Jan 28 01:09:33 CET 2014

I **think** this kind of methodological issue might be better at SO
(stats.stackexchange.com).  It's not really about R programming, which
is the main focus of this list. And yes, I know they do intersect.
Nevertheless...

Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
H. Gilbert Welch

On Mon, Jan 27, 2014 at 3:47 PM, Dimitri Liakhovitski
<dimitri.liakhovitski at gmail.com> wrote:
> Hello!
> Below, I:
> 1. Create a data set with a bunch of factors. All of them are predictors
> and 'y' is the dependent variable.
> 2. I run a classification Random Forests run with predictor importance. I
> look at 2 measures of importance - MeanDecreaseAccuracy and MeanDecreaseGini
> 3. I run 2 boostrap runs for 2 Random Forests measures of importance
> mentioned above.
>
> Question: Could anyone please explain why I am getting such a huge positive
> bias across the board (for all predictors) for MeanDecreaseAccuracy?
>
> Thanks a lot!
> Dimitri
>
>
> #----------------------------------------------------------------
> # Creating a a data set:
> #-------------------------------------------------------------
>
> N<-1000
> myset1<-c(1,2,3,4,5)
> probs1a<-c(.05,.10,.15,.40,.30)
> probs1b<-c(.05,.15,.10,.30,.40)
> probs1c<-c(.05,.05,.10,.15,.65)
> myset2<-c(1,2,3,4,5,6,7)
> probs2a<-c(.02,.03,.10,.15,.20,.30,.20)
> probs2b<-c(.02,.03,.10,.15,.20,.20,.30)
> probs2c<-c(.02,.03,.10,.10,.10,.25,.40)
> myset.y<-c(1,2)
> probs.y<-c(.65,.30)
>
> set.seed(1)
> y<-as.factor(sample(myset.y,N,replace=TRUE,probs.y))
> set.seed(2)
> a<-as.factor(sample(myset1, N, replace = TRUE,probs1a))
> set.seed(3)
> b<-as.factor(sample(myset1, N, replace = TRUE,probs1b))
> set.seed(4)
> c<-as.factor(sample(myset1, N, replace = TRUE,probs1c))
> set.seed(5)
> d<-as.factor(sample(myset2, N, replace = TRUE,probs2a))
> set.seed(6)
> e<-as.factor(sample(myset2, N, replace = TRUE,probs2b))
> set.seed(7)
> f<-as.factor(sample(myset2, N, replace = TRUE,probs2c))
>
> mydata<-data.frame(a,b,c,d,e,f,y)
>
>
> #-------------------------------------------------------------
> # Single Random Forests run with predictor importance.
> #-------------------------------------------------------------
>
> library(randomForest)
> set.seed(123)
> rf1<-randomForest(y~.,data=mydata,importance=T)
> importance(rf1)[,c(3:4)]
>
> #-------------------------------------------------------------
> # Bootstrapping run
> #-------------------------------------------------------------
>
> library(boot)
>
> ### Defining two functions to be used for bootstrapping:
>
> # myrf3 returns MeanDecreaseAccuracy:
> myrf3<-function(usedata,idx){
>   set.seed(123)
>   out<-randomForest(y~.,data=usedata[idx,],importance=T)
>   return(importance(out)[,3])
> }
>
> # myrf4 returns MeanDecreaseGini:
> myrf4<-function(usedata,idx){
>   set.seed(123)
>   out<-randomForest(y~.,data=usedata[idx,],importance=T)
>   return(importance(out)[,4])
> }
>
> ### 2 bootstrap runs:
> rfboot3<-boot(mydata,myrf3,R=10)
> rfboot4<-boot(mydata,myrf4,R=10)
>
> ### Results
> rfboot3   # for MeanDecreaseAccuracy
> colMeans(rfboot3$t)-importance(rf1)[,3]
>
> rfboot4   # for MeanDecreaseGini
> colMeans(rfboot4$t)-importance(rf1)[,4]   # for MeanDecreaseGini
>
> --
> Dimitri Liakhovitski
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.