[R] validation, calibration and Design

Frank E Harrell Jr f.harrell at vanderbilt.edu
Mon Jul 11 15:02:40 CEST 2005


Williams Scott wrote:
>  
> 
> Hi R experts,
> 
>  
> 
> I am trying to do a prognostic model validation study, using cancer
> survival data. There are 2 data sets - 1500 cases used to develop a
> nomogram, and another of 800 cases used as an independent validation
> cohort.  I have validated the nomogram in the original data (easy with
> the Design tools), and then want to show that it also has good results
> with the independent data using 60 month survival. I would also like to
> show that the nomogram is significantly different to an existing model
> based on 60 month survival data generated by it (eg by McNemar's test).

Scott,

A nomogram is a graphical device, not a model to validate.  It merely 
represents a model.

If the 800 subjects came from the same hospitals in roughly the same 
time era, you are doing an internal validation and this is an 
exceedingly inefficient way to do it.  Not only is this wasting 800 
subjects from developing the model, but the validation sample is not 
large enough to yield reliable accuracy estimates.  And I don't see how 
McNemar's test applies as nothing is binary about this problem.

If you have two models that have identical degrees of overfitting (e.g. 
were based on the same number of CANDIDATE degrees of freedom) you can 
use the rcorrp.cens function to test for differences in discrimination 
or paired predictions.

If you really have an external sample (say 800 subjects from another 
country) you can use the groupkm function in Design to get a 
Kaplan-Meier-based calibration curve.  Otherwise I would recombine the 
data, develop the model on all subjects you can get, and use the 
bootstrap to validate it.

Frank

> 
> Hence, somewhat shortened:           
> 
>  
> 
> #using R 2.01 on Windows
> 
> library(Hmisc)
> 
> library(Design)
> 
>  
> 
> data1 #dataframe with predictor variables A and B, cens and time 
> 
>       columns (months)
> 
> ddist1 <- datadist(data1) 
> 
> options(datadist='ddist1') 
> 
>  
> 
> s1 <- Surv(data1$time, data1$cens)
> 
>  
> 
> cph.nomo <- cph(s1 ~ A+B, surv=T, x=T, y=T, time.inc=60)
> 
>  
> 
> survcph <- Survival(cph.nomo, x=T, y=T, time.inc=60, surv=T)
> 
> surv5 <- function(lp) survcph(60, lp)
> 
> nomogram(cph.nomo, lp=T, conf.int=F, fun=list(surv5, surv7), 
> 
> funlabel=c("5 yr DFS"))
> 
>  
> 
> # now have a useful nomogram model, with good discrimination and
> 
> #calibration when checked with validate and calibrate (not shown)
> 
> #....move on to validation cohort of n=800
> 
>  
> 
> Data2 #Validation data with same predictor variables A, B, cens, time
> 
> # do I need to put data2 into datadist??
> 
>  
> 
> s2 <- Surv(data2$time, data2$cens)
> 
>  
> 
> #able to derive 60 month estimates of survival using
> 
> data2.est5 <- survest(cph.nomo, expand.grid(A=data2$A, B=data2$B), 
> 
> times=c(60), conf.int=0)
> 
>  
> 
> rcorr.cens(data2.est5$surv, s2) # tests discrimination of the model 
> 
> #against the validation data observed censored data
> 
>  
> 
> # I cant find a way to use calibrate in this setting though??
> 
> # Also, if I have the 5 year estimates for 2 different models, I can 
> 
> #     use rcorr.cens to show discrimination, but which values are 
> 
> #     suitable for a test of difference (eg with McNemars)?
> 
> # I have tried predict / newdata function a number of ways but it 
> 
> #     typically returns an error relating to unequal vector lengths
> 
>  
> 
>  
> 
> What I cant work out is where to go now to derive a calibration curve of
> the predicted 5 year result (val.data5) and the observed  (s2). Or can I
> do it another way? For example, could I merge the 2 data frames and use
> lines1:1500 to build the model and the last 800 lines to validate?
> 
>  
> 
> Obviously I am a novice, and sure to be missing something simple. I have
> spent countless hours pouring over Prof Harrell's text (which is great
> but doesn't have a specific example of this) and Design Help plus the R
> news archive with no success, so any help is very much appreciated. 
> 
>  
> 
> Scott Williams MD
> 
> Peter MacCallum Cancer Centre
> 
> Melbourne Australia
> 
>  
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
> 


-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University




More information about the R-help mailing list