[R] Getting the C-index for a dataset that was not used to generate the logistic model

Fri Jul 17 09:42:44 CEST 2009

Professor Harrell,

Thanks for your lightning-fast reply. It was extremely helpful, and
pointed me exactly to where I needed to go to solve my problem.

For others reading, my problem was that I was incorrectly dealing with
the validation data.

I tried to do this to generate predictions from a pre-existing model
with new data:
lrm(formula = logit.lrm,data=validationData)

However, as Prof. Harrell kindly implied, I actually needed to use the
original model to predict the probabilities for the new data set:
logit.lrm.validationPredictions <-
predict(logit.lrm,newdata=validationData,type=\"fitted.ind\")

>From there, I could cobble together a dataframe of the actual results
in the new dataset with the predicted probabilities based on the
model, and regress from there. This allowed me to generate my
statistic of interest (the C-index).

Again, thank you,

Kyle

On Thu, Jul 16, 2009 at 9:18 PM, Frank E Harrell
Jr<f.harrell at vanderbilt.edu> wrote:
> Kyle Werner wrote:
>>
>> Does anyone know how to get the C-index from a logistic model - not using
>> the dataset that was used to train the model, but instead using a fresh
>> dataset on the same model?
>>
>> I have a dataset of 400 points that I've split into two halves, one for
>> training the logistic model, and the other for evaluating it. The
>> structure
>> is as follows:
>
> Kyle - I would not trust data splitting with N < 20,000.
>
>>
>> column headers are "got a loan" (dichotomous), "hourly income"
>> (continuous),
>> and "owns own home" (dichotomous)
>> The training data is
>> *trainingData[1,] = c(0,12,0)*
>> *...*
>> etc
>>
>> and the validation data is
>> *validationData[1,] = c(1,35,1)*
>> *...*
>> etc
>>
>> I use Prof. Harrell's excellent Design modules to perform a logistic
>> regression on the training data like so:
>> *logit.lrm <- lrm(gotALoan ~ hourlyIncome+ownsHome, data=trainingData)*
>> *lrm(formula = logit.lrm)$stats[6]*
>> (output is C 0.8739827 - i.e., just the C-index)
>> **
>> I really like the ability to extract the C-index (or ROC AUC), because
>> this
>> is a factor that I find very helpful in comparing various models. However,
>> I
>> don't really want to get that from the data that the model was built on.
>> Using that C-statistic would be cheating, in a sense, since I'm just
>> testing
>> the model on the data it was built against. I would rather get the
>> C-statistic by applying the model I just generated to the other half of
>> the
>> data that I saved.
>>
>> I have tried doing this:
>> *lrm(formula = logit.lrm,data=validationData)*
>> However, this actually generates a new model (giving different
>> coefficients
>> to the variables). It doesn't simply apply the new data to the model from
>> *
>> logit.lrm* that I generated before.
>
> If you are just fitting a new model with the only predictor being the
> predicted log odds, it is true you will get a new slope and intercept, but
> this will not affect the c-index.  So you can trust the output (for the
> c-index and other rank measures such as Dxy, tau, gamma).
>
> Or use rcorr.cens(predict(fit, newdata), newdata$y) and use Dxy=2*(C-.5).
>  You can use somers2( ) if you don't need the standard error.
>
> Frank
>
>>
>> So, can someone point me in the right direction for evaluating the model
>> that I built with trainingData, but getting the C-statistic against my
>> validationData?
>>
>> Thanks so much,
>>
>> Kyle Werner
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
> --
> Frank E Harrell Jr   Professor and Chair           School of Medicine
>                     Department of Biostatistics   Vanderbilt University
>