[R] variable selection using residual difference
Hassan, Nazatulshima
Nazatulshima.Hassan at liverpool.ac.uk
Fri Mar 18 17:00:20 CET 2016
I have the following example dataset
set.seed(2001)
n <- 100
Y <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
X1 <- sample(x=c(0,1,2), size=n, replace=TRUE, prob=c(0.1,0.4,0.5))
X2 <- sample(x=c(0,1,2), size=n, replace=TRUE, prob=c(0.5,0.25,0.25))
X3 <- c(0,2,2,2,2,2,2,2,0,2,0,2,2,0,0,0,0,0,2,0,0,2,2,0,0,2,2,2,0,2,0,2,0,2,1,2,1,1,1,1,1,1,1,1,1,1,1,0,1,2,2,2,2,2,2,2,2,2,2,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,0,0,0,0)
dat <- data.frame(Y,X1,X2,X3)
I fit a logistic regression model to each of the variable to rank them based on the residual difference (highest to lowest). To simplify I got the rank as X3,X1 and X2. Then, I fit a second order model as follows and again calculate the res_dif :
mod1 <- glm(Y~X3+X1, family=binomial, data=dat)
mod1$null.deviance-mod1$deviance
mod2 <- glm(Y~X3+X2, family=binomial,data=dat)
mod2$null.deviance-mod2$deviance
Again, I will rank the model based on res_dif (highest to lowest). So here, I choose mod2. From there I will fit the third order model as follows :
mod3 <- glm(Y~X3+X2+X1, family=binomial, data=dat)
mod3$null.deviance-mod3$deviance
Basically, this continues until it fits the maximum number of variables that you have in the data.
My aim is to do variable selection based on res_dif instead of AIC, BIC or R2. Since my actual dataset is dealing with 100 of variables, I wonder how can I apply this using loop function.
Any suggestions would be appreciated.
Kind Regards
Shima
[[alternative HTML version deleted]]
More information about the R-help
mailing list