[R] Interperting results of glmnet and coxph plot, Brier score and Harrel's C-Index - am I doing something wrong ???

Mon Sep 30 19:53:44 CEST 2013

On Sep 29, 2013, at 2:16 PM, E Joffe wrote:

> HI,
> 
> Thank you for your answer.
> There were 301 events out of 394 observations.
> 
> Study goals: Identify proteins with prognostic power in patients with AML.
> There were 232 proteins studied.
> Traditional models won't converge.
> I wanted to do a multivariate survival analysis that would allow me to
> identify prognostic proteins and get an estimation of their hazard ratio.
> LASSO allowed me to model the data but in order to get the HRs I needed to
> run it a second time throw coxph - this was done based on a discussion on
> stackexchange

http://r.789695.n4.nabble.com/estimating-survival-times-with-glmnet-and-coxph-td4614225.html

>  which I tweaked a bit.
> 
> 
> The features selected by LASSO are not necessarily all significant when
> constructing a coxph model in particular due to the lack of penalization.
> This was commented on by Dr. Hastie in one of his papers (or in the video
> lecture he gave - I need to find the exact source).
> I understand that by doing this procedure I might be reducing the accuracy
> of the model which is why I used a bootstrap evaluation.
> Interestingly, the coxph models following stepwise feature selection had
> marginally better performance in the evaluation compared with the LASSO
> models (maybe the LASSO over-fit the data a bit ? - don't know)
> 
> What I still don't understand is why with a C-Index of 0.76 and a Brier
> score of 0.07 when I look at the plot of the coxph object the confidence
> intervals are so far ?

The item you referenced was not on stackexchange but rather on Nabble. The archived copy in R-help is here:

[R] Estimating survival times with glmnet and coxph 

https://stat.ethz.ch/pipermail/r-help/2012-May/312029.html
https://stat.ethz.ch/pipermail/r-help/2012-May/312038.html

The response from Terry Therneau at that time explained why the CI's were wide. I think this implies that even without the additional (faulty) step of stepwise reduction in covariates you still would not be able to use the CI's coxphfrom such a fit.

-- 
David.
> Further, I am wondering whether there is a sanity check I could perform to
> make sure there is no error in the Brier/C-Index evaluation.
> 
> Thanks again for all you advice on this project,
> Erel
> 
> 
> 
> -----Original Message-----
> From: David Winsemius [mailto:dwinsemius at comcast.net] 
> Sent: Saturday, September 28, 2013 4:41 PM
> To: E Joffe
> Cc: r-help at r-project.org
> Subject: Re: [R] Interperting results of glmnet and coxph plot, Brier score
> and Harrel's C-Index - am I doing something wrong ???
> 
> 
> On Sep 28, 2013, at 2:39 AM, E Joffe wrote:
> 
>> Hi all,
>> 
>> I am using COX LASSO (glmnet / coxnet) regression to analyze a dataset 
>> of
>> 394 obs. / 268 vars.
>> I use the following procedure:
>> 1.	Construct a coxnet on the entire dataset (by cv.glmnet)
>> 2.	Pick the significant features by selecting the non-zero coefficient
>> under the best lambda selected by the model
>> 3.	Build a coxph model with bi-directional stepwise feature selection
>> limited to the coxnet selected features.
> 
> I was a bit puzzled by the third step. Once you had a reduced model from
> glmnet, what was the statistical basis for further elimination of variables?
> 
> (Quite apart from the statistical issues, I was rather surprised that this
> procedure even produced results since the 'step' function is not described
> in the 'stats' package as applying to 'coxph' model objects.)
> 
>> 	
>> To validate the model I use both Brier score (library=peperr) and  
>> Harrel's [Harrell]
>> C-Index (library=Hmisc) with a bootstrap of 50 iterations.
>> 
>> 
>> MY QUESTION :  I am getting fairly good C-Index (0.76) and Brier  
>> (0.07)
>> values for the models however per the coxnet the %Dev explained by  
>> the model
>> is at best 0.27 and when I plot the survfit of the coxph the plotted
>> confidence interval is very large.
> 
> How many events did you have?  (The width of CI's is most importantly  
> dependent on event counts and not particularly improved by a high case  
> count. The power considerations are very similar to those of a  
> binomial test.)
> 
> 
>> What am I missing here ?
> 
> Perhaps sufficient events? (You also seem to be missing a description  
> of the study goals.)
> 
> 
> -- 
> David.
> 
>> 
>> %DEV=27%
>> 
>> 
>> 
>> Brier score - 0.07  ($ipec.coxglmnet -> [1] 7.24)
>> C-Index - 0.76 ($cIndex -> [1] 0.763)
>> 
>> 
>> 
>> DATA: [Private Health Information - can't publish] 394 obs./268 vars.
>> 
>> CODE (need to define a dataset with 'time' and 'status' variables):
>> 
>> library("survival")
>> library ("glmnet")
>> library ("c060")
>> library ("peperr")
>> library ("Hmisc")
>> 
>>   #creat Y (survival matrix) for glmnet
>>   surv_obj <- Surv(dataset$time,dataset$status)
>> 
>> 
>>   ## tranform categorical variables into binary variables with  
>> dummy for
>> dataset
>>   predict_matrix <- model.matrix(~ ., data=dataset,
>>                                  contrasts.arg = lapply
>> (dataset[,sapply(dataset, is.factor)], contrasts))
>> 
>>   ## remove the statu/time variables from the predictor matrix (x)  
>> for
>> glmnet
>>   predict_matrix <- subset (predict_matrix, select=c(-time,-status))
>> 
>>   ## create a glmnet cox object using lasso regularization and cross
>> validation
>>   glmnet.cv <- cv.glmnet (predict_matrix, surv_obj, family="cox")
>> 
>>   ## get the glmnet model on the full dataset
>>   glmnet.obj <- glmnet.cv$glmnet.fit
>> 
>>   # find lambda index for the models with least partial likelihood
>> deviance (by cv.glmnet)
>>   optimal.lambda <- glmnet.cv$lambda.min    # For a more parsimoneous
>> model use lambda.1se
>>   lambda.index <- which(glmnet.obj$lambda==optimal.lambda)
>> 
>> 
>>   # take beta for optimal lambda
>>   optimal.beta  <- glmnet.obj$beta[,lambda.index]
>> 
>>   # find non zero beta coef
>>   nonzero.coef <- abs(optimal.beta)>0
>>   selectedBeta <- optimal.beta[nonzero.coef]
>> 
>>   # take only covariates for which beta is not zero
>>   selectedVar   <- predict_matrix[,nonzero.coef]
>> 
>>   # create a dataframe for trainSet with time, status and selected
>> variables in binary representation for evaluation in pec
>>   reformat_dataSet <- as.data.frame(cbind(surv_obj,selectedVar))
>> 
>>   # glmnet.cox only with meaningful features selected by stepwise
>> bidirectional AIC feature selection
>>   glmnet.cox.meaningful <- step(coxph(Surv(time,status) ~
>> .,data=reformat_dataSet),direction="both")
>> 
>> 
>> 
>> 
>> 
> ##--------------------------------------------------------------------------
>> -----------------------------
>>   ##                                    MODEL PERFORMANCE
>> 
>> 
> ##--------------------------------------------------------------------------
>> -----------------------------
>>   ##
>> 
>> 
>>   ## Calculate the Brier score - pec does its own bootstrap so this
>> function runs on i=51 (i.e., whole trainset)
>> 
>>       ## Brier score calculation to cox-glmnet
>>       peperr.coxglmnet <- peperr(response=surv_obj, x=selectedVarCox,
>>                               fit.fun=fit.coxph, load.all=TRUE,
>> 
>> indices=resample.indices(n=nrow(surv_obj),
>> method="boot", sample.n=50))
>> 
>>       # Get error predictions from peperr
>>       prederr.coxglmnet <- perr(peperr.coxglmnet)
>> 
>>       # Integrated prediction error Brier score calculation
>>       ipec.coxglmnet<-ipec(prederr.coxglmnet,
>> eval.times=peperr.coxglmnet$attribute, response=surv_obj)
>> 
>> 
>> ## C-Index calculation 50 iter bootstrapping
>> 
>> for (i in 1:50){
>>       print (paste("Iteration:",i))
>>       train <- sample(1:nrow(dataset), nrow(dataset), replace =  
>> TRUE) ##
>> random sampling with replacement
>>       # create a dataframe for trainSet with time, status and  
>> selected
>> variables in binary representation for evaluation in pec
>>       reformat_trainSet <- reformat_dataSet [train,]
>> 
>> 
>>       # glmnet.cox only with meaningful features selected by stepwise
>> bidirectional AIC feature selection
>>       glmnet.cox.meaningful.test <- step(coxph(Surv(time,status) ~
>> .,data=reformat_dataSet),direction="both")
>> 
>>       selectedVarCox   <-
>> predict_matrix[,attr(glmnet.cox.meaningful.test$terms,"term.labels")]
>>       reformat_testSet <-  
>> as.data.frame(cbind(surv_obj,selectedVarCox))
>>       reformat_testSet <- reformat_dataSet [-train,]
>> 
>> 
>> #     compute c-index (Harrell's) for cox-glmnet models
>>       if (is.null(glmnet.cox.meaningful)){
>>         cIndexCoxglmnet <- c(cIndexCoxglmnet,NULL)
>>       }else{
>>         cIndexCoxglmnet <- c(cIndexCoxglmnet,
>> 1-rcorr.cens(predict(glmnet.cox.meaningful,
>> reformat_testSet),Surv(reformat_testSet$time,reformat_testSet 
>> $status))[1])
>>       }
>> }
>> 
>> #Get average C-Index
>> cIndex<- mean (unlist(cIndexCoxglmnet),rm.na=TRUE)
>> 
>> #create a list of all the objects generated
>> 
>> assign 
>> (name,c(eval(parse(text=name)),glmnet.cv=list(glmnet.cv),glmnet.obj=li
>> st(glmnet.obj),
>> 
>> selectedVar=list(colnames(selectedVar)),glmnet.cox=list(glmnet.cox),
>> 
>> glmnet 
>> .cox.meaningful=list(glmnet.cox.meaningful),ipec.coxglmnet=list(ipec.c
>> oxglmnet),
>>               cIndex=cIndex))
>> 
>> # save image of the workspace after each iteration
>>   save.image("final_subgroup_analysissubgroup_analysis.RData")
>> 
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> David Winsemius, MD
> Alameda, CA, USA
> 
> 

David Winsemius
Alameda, CA, USA