[R] Concordance Index - interpretation
Frank E Harrell Jr
f.harrell at vanderbilt.edu
Sat Dec 13 14:25:56 CET 2008
Gad Abraham wrote:
> K F Pearce wrote:
>> Hello everyone.
>>
>> This is a question regarding generation of the concordance index (c
>> index) in R using the function rcorr.cens. In particular about
>> interpretation of its direction and form of the 'predictor'.
>
> Since Frank Harrell hasn't replied I'll contribute my 2 cents.
>
>>
>> One of the arguments is a "numeric predictor variable" ( presumably
>> this is just a *single* predictor variable). Say this variable takes
>> numeric values.... Am I correct in thinking that if the c index is >
>> 0.5 (with Somers D positive) then this tells us that the higher the
>> numeric values of the 'predictor', the greater the survival probability
>> and similarly if the c index is <0.5 (with Somers D negative) then this
>> tells us that the higher the numeric values of the 'predictor' the
>> lower the survival probability ?
>
> The c-index is a generalisation of the area under the ROC curve (AUC),
> therefore it measures how well your model discriminates between
> different responses, i.e., is your predicted response low for low
> observed responses and high for high observed responses. So C > 0.5
> implies a good prediction ability, C = 0.5 implies no predictive ability
> (no better than random guessing), and C < 0.5 implies "good"
> anti-prediction (worse than random, but if you flip the prediction
> direction it becomes a good prediction).
>
>>
>> The c index estimates the "probability of concordance between predicted
>> and observed responses"....Harrel et al (1996) says "in predicting time
>> until death, concordance is calculated by considering all possible pairs
>> of patients, at least one of whom has died. If the *predicted* survival
>> time (probability) is larger for the patient who (actually) lived
>> longer, the predictions for that pair are said to be concordant with the
>> (actual) outcomes. ". I have read that "the c index is defined by the
>> proportion of all usable patients in which the predictions and outcomes
>> are concordant".
>>
>> Now, secondly, I'd like to ask what form the predictor can take.
>> Presumably if the predictor was a continuous-type variable e.g. 'age'
>> then predicted survival probability (calculated internally via Cox
>> regression?) would be compared with actual survival time for each
>> specific age to get the c index? Now, if the predictor was an *ordinal
>> categorical variable* where 1=worst group and 5=best group - I presume
>> that the c index would be calculated similarly but this time there would
>> be many ties in the predictor (as regards predicted survival
>> probability) - hence if I wanted to count all ties in such a case I
>> would keep the default argument outx=FALSE?
>
> Both the predictor and the actual response can be either continuous or
> categorical, as long as they are ordinal (since it's a rank-based method).
>
> I don't know about the outx part.
>
>>
>> Does anyone have a clear reference which gives the formula used to
>> generate the concordance index (with worked examples)?
>
> I think the explanation in Harrell 1996, Section 5.5 is pretty clear,
> but perhaps could've used some pseudocode. Anyway, I understand it as:
>
> 1) Create all pairs of observed responses.
> 2) For all valid response pairs, i.e., pairs where one response y_1 is
> greater than the other y_2, test whether the corresponding predictions
> are concordant, i.e, yhat_1 > yhat_2. If so add 1 to the running sum s.
> If yhat_1 = yhat_2, add 0.5 to the sum. Count the number n of valid
> response pairs.
> 3) Divide the total sum s by the number of valid response pairs n.
>
> Here's my simple attempt, unoptimised and doesn't handle censoring:
>
> # yhat: predicted response
> # y: observed response
> concordance <- function(yhat, y)
> {
> s <- 0
> n <- 0
> for(i in seq(along=y))
> {
> for(j in seq(along=y))
> {
> if(i != j)
> {
> if(y[i] > y[j])
> {
> s <- s + (yhat[i] > yhat[j]) + 0.5 * (yhat[i] == yhat[j])
> n <- n + 1
> }
> }
> }
> }
> s / n
> }
>
> See also Harrell's 2001 book "Regression Modeling Strategies", and for
> the special case of binary outcomes (which is the AUC), Hanley and
> McNeil (1982) "The Meaning and Use of the Area under a Receiver
> Operating Characteristic (ROC) Curve", Radiology 143:29--36.
>
> Cheers,
> Gad
>
>
Thanks for the great reply Gad.
outx=TRUE is used to not 'penalize' for ties on the predictions (or the
single variable given as x); this results in Goodman-Kruskal gamma-type
rank correlation indexes. When comparing different predictions with
different number of ties, it is especially not a good idea to discard
ties in x.
The Fortran code that comes with Hmisc can also be viewed to see the
exact algorithms.
Frank
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
More information about the R-help
mailing list