[R] Concordance Index - interpretation

Frank E Harrell Jr f.harrell at vanderbilt.edu
Sat Dec 13 14:25:56 CET 2008


Gad Abraham wrote:
> K F Pearce wrote:
>> Hello everyone.
>>  
>> This is a question regarding generation of the concordance index (c
>> index) in R using the function rcorr.cens.  In particular about
>> interpretation of its direction and form of the 'predictor'.
> 
> Since Frank Harrell hasn't replied I'll contribute my 2 cents.
> 
>>  
>> One of the arguments is a "numeric predictor variable" ( presumably
>> this is just a *single* predictor variable).  Say this variable takes
>> numeric values....  Am I correct in thinking that if the c index is >
>> 0.5 (with Somers D positive) then  this tells us that the higher the
>> numeric values of the 'predictor', the  greater the survival probability
>> and similarly if the c index is <0.5 (with Somers D negative) then  this
>> tells us that the higher the numeric values of the 'predictor' the
>> lower  the survival probability ?
> 
> The c-index is a generalisation of the area under the ROC curve (AUC), 
> therefore it measures how well your model discriminates between 
> different responses, i.e., is your predicted response low for low 
> observed responses and high for high observed responses. So C > 0.5 
> implies a good prediction ability, C = 0.5 implies no predictive ability 
> (no better than random guessing), and C < 0.5 implies "good" 
> anti-prediction (worse than random, but if you flip the prediction 
> direction it becomes a good prediction).
> 
>>  
>> The c index  estimates the "probability of concordance between predicted
>> and observed responses"....Harrel et al (1996) says "in predicting time
>> until death, concordance is calculated by considering all possible pairs
>> of patients, at least one of whom has died.  If the *predicted* survival
>> time (probability) is larger for the patient who (actually) lived
>> longer, the predictions for that pair are said to be concordant with the
>> (actual) outcomes.  ".  I have read that "the c index is defined by the
>> proportion of all usable patients in which the predictions and outcomes
>> are concordant".
>>  
>> Now, secondly, I'd like to ask what form the predictor can take.
>> Presumably if the predictor was a continuous-type variable e.g. 'age'
>> then predicted survival probability (calculated internally via Cox
>> regression?) would be compared with actual survival time for each
>> specific age to get the c index?  Now, if the predictor was an *ordinal
>> categorical variable* where 1=worst group and 5=best group - I presume
>> that the c index would be calculated similarly but this time there would
>> be many ties in the predictor (as regards predicted survival
>> probability) - hence  if I wanted to count all ties in such a case I
>> would keep the default argument outx=FALSE? 
> 
> Both the predictor and the actual response can be either continuous or 
> categorical, as long as they are ordinal (since it's a rank-based method).
> 
> I don't know about the outx part.
> 
>>
>> Does anyone have a clear reference which gives the formula used to
>> generate the concordance index (with worked examples)? 
> 
> I think the explanation in Harrell 1996, Section 5.5 is pretty clear, 
> but perhaps could've used some pseudocode. Anyway, I understand it as:
> 
> 1) Create all pairs of observed responses.
> 2) For all valid response pairs, i.e., pairs where one response y_1 is 
> greater than the other y_2, test whether the corresponding predictions 
> are concordant, i.e, yhat_1 > yhat_2. If so add 1 to the running sum s. 
> If yhat_1 = yhat_2, add 0.5 to the sum. Count the number n of valid 
> response pairs.
> 3) Divide the total sum s by the number of valid response pairs n.
> 
> Here's my simple attempt, unoptimised and doesn't handle censoring:
> 
> # yhat: predicted response
> # y: observed response
> concordance <- function(yhat, y)
> {
>    s <- 0
>    n <- 0
>    for(i in seq(along=y))
>    {
>       for(j in seq(along=y))
>       {
>      if(i != j)
>      {
>         if(y[i] > y[j])
>         {
>            s <- s + (yhat[i] > yhat[j]) + 0.5 * (yhat[i] == yhat[j])
>            n <- n + 1
>         }
>      }
>       }
>    }
>    s / n
> }
> 
> See also Harrell's 2001 book "Regression Modeling Strategies", and for 
> the special case of binary outcomes (which is the AUC), Hanley and 
> McNeil (1982) "The Meaning and Use of the Area under a Receiver 
> Operating Characteristic (ROC) Curve", Radiology 143:29--36.
> 
> Cheers,
> Gad
> 
> 

Thanks for the great reply Gad.

outx=TRUE is used to not 'penalize' for ties on the predictions (or the 
single variable given as x); this results in Goodman-Kruskal gamma-type 
rank correlation indexes.  When comparing different predictions with 
different number of ties, it is especially not a good idea to discard 
ties in x.

The Fortran code that comes with Hmisc can also be viewed to see the 
exact algorithms.

Frank

-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University



More information about the R-help mailing list