[R] Standard error for the area under a smoothed ROC curve?

Wed Jan 12 19:13:18 CET 2005

On Wed, 12 Jan 2005, Frank E Harrell Jr wrote:

>Dan Bolser wrote:
>> Hello, 
>> 
>> I am making some use of ROC curve analysis. 
>> 
>> I find much help on the mailing list, and I have used the Area Under the
>> Curve (AUC) functions from the ROC function in the bioconductor project...
>> 
>> http://www.bioconductor.org/repository/release1.5/package/Source/
>> ROC_1.0.13.tar.gz 
>> 
>> However, I read here...
>> 
>> http://www.medcalc.be/manual/mpage06-13b.php
>> 
>> "The 95% confidence interval for the area can be used to test the
>> hypothesis that the theoretical area is 0.5. If the confidence interval
>> does not include the 0.5 value, then there is evidence that the laboratory
>> test does have an ability to distinguish between the two groups (Hanley &
>> McNeil, 1982; Zweig & Campbell, 1993)."
>> 
>> But aside from early on the above article is short on details. Can anyone
>> tell me how to calculate the CI of the AUC calculation?
>> 
>> 
>> I read this...
>> 
>> http://www.bioconductor.org/repository/devel/vignette/ROCnotes.pdf
>> 
>> Which talks about resampling (by showing R code), but I can't understand
>> what is going on, or what is calculated (the example given is specific to
>> microarray analysis I think).
>> 
>> I think a general AUC CI function would be a good addition to the ROC
>> package.
>> 
>> 
>> 
>> 
>> One more thing, in calculating the AUC I see the splines function is
>> recomended over the approx function. Here...
>> 
>> http://tolstoy.newcastle.edu.au/R/help/04/10/6138.html
>> 
>> How would I rewrite the following AUC functions (adapted from bioconductor
>> source) to use splines (or approxfun or splinefun) ...
>> 
>> 
>>>spe # Specificity
>> 
>>  [1] 0.02173913 0.13043478 0.21739130 0.32608696 0.43478261 0.54347826
>>  [7] 0.65217391 0.76086957 0.89130435 1.00000000 1.00000000 1.00000000
>> [13] 1.00000000
>> 
>> 
>>>sen # Sensitivity
>> 
>>  [1] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.9302326 0.8139535
>>  [8] 0.6976744 0.5581395 0.4418605 0.3488372 0.2325581 0.1162791
>> 
>> trapezint(1-spe,sen)
>> my.integrate(1-spe,sen)
>> 
>> ## Functions
>> ## Nicked (and modified) from the ROC function in bioconductor.
>> "trapezint" <-
>> function (x, y, a = 0, b = 1)
>> {
>>     if (x[1] > x[length(x)]) {
>>       x <- rev(x)
>>       y <- rev(y)
>>     }
>>     y <- y[x >= a & x <= b]
>>     x <- x[x >= a & x <= b]
>>     if (length(unique(x)) < 2)
>>         return(NA)
>>     ya <- approx(x, y, a, ties = max, rule = 2)$y
>>     yb <- approx(x, y, b, ties = max, rule = 2)$y
>>     x <- c(a, x, b)
>>     y <- c(ya, y, yb)
>>     h <- diff(x)
>>     lx <- length(x)
>>     0.5 * sum(h * (y[-1] + y[-lx]))
>> }
>> 
>> "my.integrate" <-
>> function (x, y, t0 = 1)
>> {
>>     f <- function(j) approx(x,y,j,rule=2,ties=max)$y
>>     integrate(f, 0, t0)$value
>> }
>> 
>> 
>> 
>> 
>> 
>> Thanks for any pointers,
>> Dan.
>
>I don't see why the above formulas are being used.  The 
>Bamber-Hanley-McNeil-Wilcoxon-Mann-Whitney nonparametric method works 
>great.  Just get the U statistic (concordance probability) used in 
>Wilcoxon.  As Somers' Dxy rank correlation coefficient is 2*(1-C) where 
>C is the concordance or ROC area, the Hmisc package function rcorr.cens 
>uses U statistic methods to get the standard error of Dxy.  You can 
>easily translate this to a standard error of C.

I am sure I could do this easily, except I can't. 

The good thing about ROC is that I understand it (I can see it). I know
why the area means what it means, and I could even imagine how sampling
the data could give a CI on the area. 

However, I don't know why "the area under the ROC curve is well known to
be equivalent to the numerator of the Mann-Whitney U statistic" - from

http://www.bioconductor.org/repository/devel/vignette/ROCnotes.pdf

Nor do I know how to calculate "the numerator of the Mann-Whitney U
statistic".

Can you point me at some ? pages or tutorials or even give an example of
what you suggested so I can try to follow it through?

I tried the following...

x <- rnorm(100,5,1)    # REAL NEGATIVE
#
y <- rnorm(100,8,1)    # REAL POSITIVE

t <- wilcox.test(x,y,paired=FALSE,conf.int=0.95)

> t

	Wilcoxon rank sum test with continuity correction

data:  x and y 
W = 132, p-value < 2.2e-16
alternative hypothesis: true mu is not equal to 0 
95 percent confidence interval:
 -3.232207 -2.664620 
sample estimates:
difference in location 
             -2.957496 

And from ?wilcox.test ...

"if both x and y are given and paired is FALSE, a Wilcoxon rank sum test
(equivalent to the Mann-Whitney test) is carried out."

But I don't know what to do next. Sorry for all the questions, but I am a
dumb biologist.

Thanks for the help, Dan.

>
>Frank
>
>