[R] Getting the C-index for a dataset that was not used to generate the logistic model
Kyle Werner
kylewerner10 at gmail.com
Fri Jul 17 06:14:45 CEST 2009
Does anyone know how to get the C-index from a logistic model - not
using the dataset that was used to train the model, but instead using
a fresh dataset on the same model?
I have a dataset of 400 points that I've split into two halves, one
for training the logistic model, and the other for evaluating it. The
structure is as follows:
column headers are "got a loan" (dichotomous), "hourly income"
(continuous), and "owns own home" (dichotomous)
The training data is
trainingData[1,] = c(0,12,0)
...
etc
and the validation data is
validationData[1,] = c(1,35,1)
...
etc
I use Prof. Harrell's excellent Design modules to perform a logistic
regression on the training data like so:
logit.lrm <- lrm(gotALoan ~ hourlyIncome+ownsHome, data=trainingData)
lrm(formula = logit.lrm)$stats[6]
(output is C 0.8739827 - i.e., just the C-index)
I really like the ability to extract the C-index (or ROC AUC), because
this is a factor that I find very helpful in comparing various models.
However, I don't really want to get that from the data that the model
was built on. Using that C-statistic would be cheating, in a sense,
since I'm just testing the model on the data it was built against. I
would rather get the C-statistic by applying the model I just
generated to the other half of the data that I saved.
I have tried doing this:
lrm(formula = logit.lrm,data=validationData)
However, this actually generates a new model (giving different
coefficients to the variables). It doesn't simply apply the new data
to the model from logit.lrm that I generated before.
So, can someone point me in the right direction for evaluating the
model that I built with trainingData, but getting the C-statistic
against my validationData?
Thanks so much,
Kyle Werner
(Resending because I accidentally HTML formatted my original post so
it was scrubbed.)
More information about the R-help
mailing list