[R] problem with BootCV for coxph in pec after feature selection with glmnet (lasso)
E Joffe
ejoffe at hotmail.com
Sat Jul 6 02:16:25 CEST 2013
Hi,
I am attempting to evaluate the prediction error of a coxph model that was
built after feature selection with glmnet.
In the preprocessing stage I used na.omit (dataset) to remove NAs.
I reconstructed all my factor variables into binary variables with dummies
(using model.matrix)
I then used glmnet lasso to fit a cox model and select the best performing
features.
Then I fit a coxph model using only these feature.
When I try to evaluate the model using pec and a bootstrap I get an error
that the prediction matrix has wrong dimensions.
Suddenly the cox object has 318 variables instead of 356 variables in the
dataset.
I don't know why this is happening.
The cox object I assign to pec and the dataframe are both of the same size.
However, once pec refits the model its size changes (356 -> 318 variables).
Apparently something is happening during the bootstrap sampling that removes
some variables.
As mentioned, I used na.omit in the preprocessing so there should not be any
NAs.
Here are some details from my workspace:
Reformat_dataSet: 368 obs. Of 356 variables
print (glmnet.cox) ----> 354 df, p=1 n= 368, number of events= 288 (354 df
= 354 variables + time and status => 356 variables)
Here is the pec function and the error:
pec.f <- as.formula(Hist(time,status) ~ 1)
brierGlmnet <- pec(list(glmnet.cox), data = reformat_Dataset,
splitMethod="BootCV", B=50, formula = pec.f)
> Error in predictSurvProb.coxph(object = structure(list(coefficients =
structure(c(-4.27787223119601, :
Prediction matrix has wrong dimensions:
368 rows and 318 columns.
But requested are predicted probabilities for
118 subjects (rows) in newdata and 356 time points (columns)
This may happen when some covariate values are missing in newdata!?
Here are the relevant sections of the code:
trainSet <- na.omit (dataset)
#creat Y (survival matrix) for glmnet
surv_obj <- Surv(trainSet$time,trainSet$status)
## tranform categorical variables into binary variables with dummy for
trainSet
predict_matrix <- model.matrix(~ ., data=trainSet,
contrasts.arg = lapply
(trainSet[,sapply(trainSet, is.factor)], contrasts, contrasts=FALSE))
## remove the statu/time variables from the predictor matrix (x) for glmnet
predict_matrix <- subset (predict_matrix, select=c(-time,-status))
## create a glmnet cox object using lasso regularization
glmnet.obj <- glmnet (predict_matrix, surv_obj, family="cox")
# find lambda for which dev.ratio is max
max.dev.index <- which.max(glmnet.obj$dev.ratio)
optimal.lambda <- glmnet.obj$lambda[max.dev.index]
# take beta for optimal lambda
optimal.beta <- glmnet.obj$beta[,max.dev.index]
# find non zero beta coef
nonzero.coef <- abs(optimal.beta)>0
selectedBeta <- optimal.beta[nonzero.coef]
# take only covariates for which beta is not zero
selectedVar <- predict_matrix[,nonzero.coef]
# create a dataframe for trainSet with time, status and selected variables
in binary representation for evaluation in pec
reformat_dataSet <- as.data.frame(cbind(surv_obj,selectedVar))
# create coxph object with pre-defined coefficients
glmnet.cox <- coxph(surv_obj ~ selectedVar, init=selectedBeta,iter=0)
## Brier score for the cox-glmnet model
brierGlmnet <- pec(list(glmnet.cox), data = reformat_Dataset,
splitMethod="BootCV", B=50,
formula = pec.f)
Thank you !!!
More information about the R-help
mailing list