[R-sig-Geo] Getting averaged variable importance from bootstrap with the cubist models

Sun Apr 7 17:52:55 CEST 2019

Dear All,

I am using a Bootstrapping approach with the Cubist model. It is possible
to get the variable importance (percentages  in variable usage in the
models) after running the models. For each run of the model based on the
random sampling, the variable importance is
different. A robust estimate may be determined by taking the average of all
the percentages of usage for each specific variable involved in the models.
For this purpose, I have (1) saved each model to disk,  (2) read back in R
each model, (2) got the variable importance
 for each model, (3) got a final table with all the models and the
percentages allocated by each model for the variables. the final table I am
expecting should like the table below. It is possible to do  it manually by
calling in each model but the workload is too high when you have 100 number
of bootstraps.

Here below a reproducible example with dataset from the ithir package. I
have limited the bootstrap to 3 for quick run. Also below I have shown how
I could get the table focusing on one model at a time.

Is there anyway to have all the models called in a loop, and extract the
variable importance for each model and arrange them finally
to get the table instead of doing it for each model separately ? I am
actually running the models 100 times.

I will appreciate any help.

Expected table-------------------------------------------------------
Variables VImp cubist.type
MRVBF 62.5 Vimp.cubist1
AACN 72.5 Vimp.cubist1
NDVI 41 Vimp.cubist1
Mid_Slope_Positon 24.5 Vimp.cubist1
Landsat_Band1 24 Vimp.cubist1
Terrain_Ruggedness_Index 44.5 Vimp.cubist1
TWI 42 Vimp.cubist1
Hillshading 41.5 Vimp.cubist1
Slope 41.5 Vimp.cubist1
Light_insolation 16 Vimp.cubist1
Elevation 11.5 Vimp.cubist1
MRVBF 81.5 Vimp.cubist2
Elevation 48.5 Vimp.cubist2
NDVI 62.5 Vimp.cubist2
Mid_Slope_Positon 43 Vimp.cubist2
TWI 54.5 Vimp.cubist2
AACN 41.5 Vimp.cubist2
Landsat_Band1 38.5 Vimp.cubist2
Hillshading 37.5 Vimp.cubist2
Terrain_Ruggedness_Index 27 Vimp.cubist2
Slope 21 Vimp.cubist2
Light_insolation 20.5 Vimp.cubist2
MRVBF 78 Vimp.cubist3
Hillshading 62 Vimp.cubist3
TWI 57 Vimp.cubist3
Terrain_Ruggedness_Index 55.5 Vimp.cubist3
AACN 55 Vimp.cubist3
Light_insolation 38.5 Vimp.cubist3
NDVI 50.5 Vimp.cubist3
Slope 49 Vimp.cubist3
Landsat_Band1 47.5 Vimp.cubist3
Mid_Slope_Positon 45.5 Vimp.cubist3
Elevation 31.5 Vimp.cubist3

## REPRODUCIBLE EXAMPLE

rm(list = ls())

library(Cubist)
library(ithir) # install.packages("ithir", repos="
http://R-Forge.R-project.org <http://r-forge.r-project.org/>")
library(caret)

#Set a working directory
setwd("")

# Point data
data(HV_subsoilpH)

# subset data for modeling
set.seed(123)
training <- sample(nrow(HV_subsoilpH), 0.7 * nrow(HV_subsoilpH))
cDat <- HV_subsoilpH[training, ]
vDat <- HV_subsoilpH[-training, ]

#Create folder to store models
dir.create("models", recursive = TRUE)

# Number of bootstraps
nbag <- 3

# Fit cubist models for each bootstrap
for (i in 1:nbag) {
  trainingREP <- sample.int(nrow(cDat), 1.0 * nrow(cDat),replace = TRUE)
  fit_cubist <- cubist(x = cDat[trainingREP,
                                c("Terrain_Ruggedness_Index",
                                  "AACN", "Landsat_Band1", "Elevation",
"Hillshading",
                                  "Light_insolation", "Mid_Slope_Positon",
"MRVBF", "NDVI",
                                  "TWI", "Slope")],
                       y = cDat$pH60_100cm[trainingREP],
cubistControl(rules = 5,
                                       extrapolation = 5), committees = 3)

  modelFile <-paste(getwd(), "./models/",sep = "", "bootMod_", i,".rds")
  saveRDS(object = fit_cubist, file = modelFile)
}

# List all files in directory
c.models <- list.files(path = paste(getwd(),"./models",
                      sep = ""), pattern = "\\.rds$", full.names = TRUE)

#Reads the models
fit_cubist<-  readRDS(c.models[i])
varImp(fit_cubist) # But it only give variable importance for the final
model

#Alternatively, I can call the model one by one and apply varImp
#Call models one by one
fit_cubist1<-  readRDS("./models/bootMod_1.rds")
fit_cubist2<-  readRDS("./models/bootMod_2.rds")
fit_cubist3<-  readRDS("./models/bootMod_3.rds")

#Apply varImp on each model
VImp1<-varImp(fit_cubist1)
VImp2<-varImp(fit_cubist2)
VImp3<-varImp(fit_cubist3)

#Get a data frame for each variable importance
cub.VImp1<-as.data.frame(VImp1[1])
names(cub.VImp1)[1]<-"VImp"
cub.VImp2<-as.data.frame(VImp2[1])
names(cub.VImp2)[1]<-"VImp"
cub.VImp3<-as.data.frame(VImp3[1])
names(cub.VImp3)[1]<-"VImp"

#Create a column for each variable importance related to each model
cub.VImp1$cubist.type<-"Vimp.cubist1"
cub.VImp2$cubist.type<-"Vimp.cubist2"
cub.VImp3$cubist.type<-"Vimp.cubist3"

#Join the row names to each dataframe
cub.VImp1<-setNames(cbind(rownames(cub.VImp1),
cub.VImp1),c("Variables","VImp","cubist.model"))
cub.VImp2<-setNames(cbind(rownames(cub.VImp2),
cub.VImp2),c("Variables","VImp","cubist.model"))
cub.VImp3<-setNames(cbind(rownames(cub.VImp3),
cub.VImp3),c("Variables","VImp","cubist.model"))

#Bind all the models df together by rows
All.VarImp<-bind_rows(cub.VImp1,cub.VImp2,cub.VImp3)

#Get the average of variable importance for each variable
mean.VarImp <- ddply(All.VarImp, .(Variables), summarise,
               mean = mean(VImp),
               sd   = sd( VImp))
mean.VarImp

Ozias Hounkpatin

Post doc

Sveriges lantbruksuniversitet

Swedish University of Agricultural Sciences

Dept. of Soil and Environment

PO Box 1234, SE-123 45 Uppsala

Visiting address: Lennart Hjems våg 9

Phone: +46 18 67 12 51, Mobile: +46 72 207 85 62

ozias.hounkpatin using slu.se , www.slu.se

Hounkpatin Ozias <oziashounkpatin using gmail.com>
17:41 (10 minutes ago)
to R-sig-geo
ReplyForward
<https://drive.google.com/u/0/settings/storage?hl=en-GB>
<https://www.google.com/intl/en-GB/policies/terms/>
<https://www.google.com/intl/en-GB/policies/privacy/>
<https://www.google.com/gmail/about/policy/>

	[[alternative HTML version deleted]]