[R] Interperting results of glmnet and coxph plot, Brier score and Harrel's C-Index - am I doing something wrong ???
E Joffe
ejoffe at hotmail.com
Sun Sep 29 23:16:52 CEST 2013
HI,
Thank you for your answer.
There were 301 events out of 394 observations.
Study goals: Identify proteins with prognostic power in patients with AML.
There were 232 proteins studied.
Traditional models won't converge.
I wanted to do a multivariate survival analysis that would allow me to
identify prognostic proteins and get an estimation of their hazard ratio.
LASSO allowed me to model the data but in order to get the HRs I needed to
run it a second time throw coxph - this was done based on a discussion on
stackexchange
http://r.789695.n4.nabble.com/estimating-survival-times-with-glmnet-and-coxp
h-td4614225.html which I tweaked a bit.
The features selected by LASSO are not necessarily all significant when
constructing a coxph model in particular due to the lack of penalization.
This was commented on by Dr. Hastie in one of his papers (or in the video
lecture he gave - I need to find the exact source).
I understand that by doing this procedure I might be reducing the accuracy
of the model which is why I used a bootstrap evaluation.
Interestingly, the coxph models following stepwise feature selection had
marginally better performance in the evaluation compared with the LASSO
models (maybe the LASSO over-fit the data a bit ? - don't know)
What I still don't understand is why with a C-Index of 0.76 and a Brier
score of 0.07 when I look at the plot of the coxph object the confidence
intervals are so far ?
Further, I am wondering whether there is a sanity check I could perform to
make sure there is no error in the Brier/C-Index evaluation.
Thanks again for all you advice on this project,
Erel
-----Original Message-----
From: David Winsemius [mailto:dwinsemius at comcast.net]
Sent: Saturday, September 28, 2013 4:41 PM
To: E Joffe
Cc: r-help at r-project.org
Subject: Re: [R] Interperting results of glmnet and coxph plot, Brier score
and Harrel's C-Index - am I doing something wrong ???
On Sep 28, 2013, at 2:39 AM, E Joffe wrote:
> Hi all,
>
> I am using COX LASSO (glmnet / coxnet) regression to analyze a dataset
> of
> 394 obs. / 268 vars.
> I use the following procedure:
> 1. Construct a coxnet on the entire dataset (by cv.glmnet)
> 2. Pick the significant features by selecting the non-zero coefficient
> under the best lambda selected by the model
> 3. Build a coxph model with bi-directional stepwise feature selection
> limited to the coxnet selected features.
I was a bit puzzled by the third step. Once you had a reduced model from
glmnet, what was the statistical basis for further elimination of variables?
(Quite apart from the statistical issues, I was rather surprised that this
procedure even produced results since the 'step' function is not described
in the 'stats' package as applying to 'coxph' model objects.)
>
> To validate the model I use both Brier score (library=peperr) and
> Harrel's [Harrell]
> C-Index (library=Hmisc) with a bootstrap of 50 iterations.
>
>
> MY QUESTION : I am getting fairly good C-Index (0.76) and Brier
> (0.07)
> values for the models however per the coxnet the %Dev explained by
> the model
> is at best 0.27 and when I plot the survfit of the coxph the plotted
> confidence interval is very large.
How many events did you have? (The width of CI's is most importantly
dependent on event counts and not particularly improved by a high case
count. The power considerations are very similar to those of a
binomial test.)
> What am I missing here ?
Perhaps sufficient events? (You also seem to be missing a description
of the study goals.)
--
David.
>
> %DEV=27%
>
>
>
> Brier score - 0.07 ($ipec.coxglmnet -> [1] 7.24)
> C-Index - 0.76 ($cIndex -> [1] 0.763)
>
>
>
> DATA: [Private Health Information - can't publish] 394 obs./268 vars.
>
> CODE (need to define a dataset with 'time' and 'status' variables):
>
> library("survival")
> library ("glmnet")
> library ("c060")
> library ("peperr")
> library ("Hmisc")
>
> #creat Y (survival matrix) for glmnet
> surv_obj <- Surv(dataset$time,dataset$status)
>
>
> ## tranform categorical variables into binary variables with
> dummy for
> dataset
> predict_matrix <- model.matrix(~ ., data=dataset,
> contrasts.arg = lapply
> (dataset[,sapply(dataset, is.factor)], contrasts))
>
> ## remove the statu/time variables from the predictor matrix (x)
> for
> glmnet
> predict_matrix <- subset (predict_matrix, select=c(-time,-status))
>
> ## create a glmnet cox object using lasso regularization and cross
> validation
> glmnet.cv <- cv.glmnet (predict_matrix, surv_obj, family="cox")
>
> ## get the glmnet model on the full dataset
> glmnet.obj <- glmnet.cv$glmnet.fit
>
> # find lambda index for the models with least partial likelihood
> deviance (by cv.glmnet)
> optimal.lambda <- glmnet.cv$lambda.min # For a more parsimoneous
> model use lambda.1se
> lambda.index <- which(glmnet.obj$lambda==optimal.lambda)
>
>
> # take beta for optimal lambda
> optimal.beta <- glmnet.obj$beta[,lambda.index]
>
> # find non zero beta coef
> nonzero.coef <- abs(optimal.beta)>0
> selectedBeta <- optimal.beta[nonzero.coef]
>
> # take only covariates for which beta is not zero
> selectedVar <- predict_matrix[,nonzero.coef]
>
> # create a dataframe for trainSet with time, status and selected
> variables in binary representation for evaluation in pec
> reformat_dataSet <- as.data.frame(cbind(surv_obj,selectedVar))
>
> # glmnet.cox only with meaningful features selected by stepwise
> bidirectional AIC feature selection
> glmnet.cox.meaningful <- step(coxph(Surv(time,status) ~
> .,data=reformat_dataSet),direction="both")
>
>
>
>
>
##--------------------------------------------------------------------------
> -----------------------------
> ## MODEL PERFORMANCE
>
>
##--------------------------------------------------------------------------
> -----------------------------
> ##
>
>
> ## Calculate the Brier score - pec does its own bootstrap so this
> function runs on i=51 (i.e., whole trainset)
>
> ## Brier score calculation to cox-glmnet
> peperr.coxglmnet <- peperr(response=surv_obj, x=selectedVarCox,
> fit.fun=fit.coxph, load.all=TRUE,
>
> indices=resample.indices(n=nrow(surv_obj),
> method="boot", sample.n=50))
>
> # Get error predictions from peperr
> prederr.coxglmnet <- perr(peperr.coxglmnet)
>
> # Integrated prediction error Brier score calculation
> ipec.coxglmnet<-ipec(prederr.coxglmnet,
> eval.times=peperr.coxglmnet$attribute, response=surv_obj)
>
>
> ## C-Index calculation 50 iter bootstrapping
>
> for (i in 1:50){
> print (paste("Iteration:",i))
> train <- sample(1:nrow(dataset), nrow(dataset), replace =
> TRUE) ##
> random sampling with replacement
> # create a dataframe for trainSet with time, status and
> selected
> variables in binary representation for evaluation in pec
> reformat_trainSet <- reformat_dataSet [train,]
>
>
> # glmnet.cox only with meaningful features selected by stepwise
> bidirectional AIC feature selection
> glmnet.cox.meaningful.test <- step(coxph(Surv(time,status) ~
> .,data=reformat_dataSet),direction="both")
>
> selectedVarCox <-
> predict_matrix[,attr(glmnet.cox.meaningful.test$terms,"term.labels")]
> reformat_testSet <-
> as.data.frame(cbind(surv_obj,selectedVarCox))
> reformat_testSet <- reformat_dataSet [-train,]
>
>
> # compute c-index (Harrell's) for cox-glmnet models
> if (is.null(glmnet.cox.meaningful)){
> cIndexCoxglmnet <- c(cIndexCoxglmnet,NULL)
> }else{
> cIndexCoxglmnet <- c(cIndexCoxglmnet,
> 1-rcorr.cens(predict(glmnet.cox.meaningful,
> reformat_testSet),Surv(reformat_testSet$time,reformat_testSet
> $status))[1])
> }
> }
>
> #Get average C-Index
> cIndex<- mean (unlist(cIndexCoxglmnet),rm.na=TRUE)
>
> #create a list of all the objects generated
>
> assign
> (name,c(eval(parse(text=name)),glmnet.cv=list(glmnet.cv),glmnet.obj=li
> st(glmnet.obj),
>
> selectedVar=list(colnames(selectedVar)),glmnet.cox=list(glmnet.cox),
>
> glmnet
> .cox.meaningful=list(glmnet.cox.meaningful),ipec.coxglmnet=list(ipec.c
> oxglmnet),
> cIndex=cIndex))
>
> # save image of the workspace after each iteration
> save.image("final_subgroup_analysissubgroup_analysis.RData")
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
Alameda, CA, USA
More information about the R-help
mailing list