[R] Standard error for the area under a smoothed ROC curve?
Dan Bolser
dmb at mrc-dunn.cam.ac.uk
Wed Jan 12 19:13:18 CET 2005
On Wed, 12 Jan 2005, Frank E Harrell Jr wrote:
>Dan Bolser wrote:
>> Hello,
>>
>> I am making some use of ROC curve analysis.
>>
>> I find much help on the mailing list, and I have used the Area Under the
>> Curve (AUC) functions from the ROC function in the bioconductor project...
>>
>> http://www.bioconductor.org/repository/release1.5/package/Source/
>> ROC_1.0.13.tar.gz
>>
>> However, I read here...
>>
>> http://www.medcalc.be/manual/mpage06-13b.php
>>
>> "The 95% confidence interval for the area can be used to test the
>> hypothesis that the theoretical area is 0.5. If the confidence interval
>> does not include the 0.5 value, then there is evidence that the laboratory
>> test does have an ability to distinguish between the two groups (Hanley &
>> McNeil, 1982; Zweig & Campbell, 1993)."
>>
>> But aside from early on the above article is short on details. Can anyone
>> tell me how to calculate the CI of the AUC calculation?
>>
>>
>> I read this...
>>
>> http://www.bioconductor.org/repository/devel/vignette/ROCnotes.pdf
>>
>> Which talks about resampling (by showing R code), but I can't understand
>> what is going on, or what is calculated (the example given is specific to
>> microarray analysis I think).
>>
>> I think a general AUC CI function would be a good addition to the ROC
>> package.
>>
>>
>>
>>
>> One more thing, in calculating the AUC I see the splines function is
>> recomended over the approx function. Here...
>>
>> http://tolstoy.newcastle.edu.au/R/help/04/10/6138.html
>>
>> How would I rewrite the following AUC functions (adapted from bioconductor
>> source) to use splines (or approxfun or splinefun) ...
>>
>>
>>>spe # Specificity
>>
>> [1] 0.02173913 0.13043478 0.21739130 0.32608696 0.43478261 0.54347826
>> [7] 0.65217391 0.76086957 0.89130435 1.00000000 1.00000000 1.00000000
>> [13] 1.00000000
>>
>>
>>>sen # Sensitivity
>>
>> [1] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.9302326 0.8139535
>> [8] 0.6976744 0.5581395 0.4418605 0.3488372 0.2325581 0.1162791
>>
>> trapezint(1-spe,sen)
>> my.integrate(1-spe,sen)
>>
>> ## Functions
>> ## Nicked (and modified) from the ROC function in bioconductor.
>> "trapezint" <-
>> function (x, y, a = 0, b = 1)
>> {
>> if (x[1] > x[length(x)]) {
>> x <- rev(x)
>> y <- rev(y)
>> }
>> y <- y[x >= a & x <= b]
>> x <- x[x >= a & x <= b]
>> if (length(unique(x)) < 2)
>> return(NA)
>> ya <- approx(x, y, a, ties = max, rule = 2)$y
>> yb <- approx(x, y, b, ties = max, rule = 2)$y
>> x <- c(a, x, b)
>> y <- c(ya, y, yb)
>> h <- diff(x)
>> lx <- length(x)
>> 0.5 * sum(h * (y[-1] + y[-lx]))
>> }
>>
>> "my.integrate" <-
>> function (x, y, t0 = 1)
>> {
>> f <- function(j) approx(x,y,j,rule=2,ties=max)$y
>> integrate(f, 0, t0)$value
>> }
>>
>>
>>
>>
>>
>> Thanks for any pointers,
>> Dan.
>
>I don't see why the above formulas are being used. The
>Bamber-Hanley-McNeil-Wilcoxon-Mann-Whitney nonparametric method works
>great. Just get the U statistic (concordance probability) used in
>Wilcoxon. As Somers' Dxy rank correlation coefficient is 2*(1-C) where
>C is the concordance or ROC area, the Hmisc package function rcorr.cens
>uses U statistic methods to get the standard error of Dxy. You can
>easily translate this to a standard error of C.
I am sure I could do this easily, except I can't.
The good thing about ROC is that I understand it (I can see it). I know
why the area means what it means, and I could even imagine how sampling
the data could give a CI on the area.
However, I don't know why "the area under the ROC curve is well known to
be equivalent to the numerator of the Mann-Whitney U statistic" - from
http://www.bioconductor.org/repository/devel/vignette/ROCnotes.pdf
Nor do I know how to calculate "the numerator of the Mann-Whitney U
statistic".
Can you point me at some ? pages or tutorials or even give an example of
what you suggested so I can try to follow it through?
I tried the following...
x <- rnorm(100,5,1) # REAL NEGATIVE
#
y <- rnorm(100,8,1) # REAL POSITIVE
t <- wilcox.test(x,y,paired=FALSE,conf.int=0.95)
> t
Wilcoxon rank sum test with continuity correction
data: x and y
W = 132, p-value < 2.2e-16
alternative hypothesis: true mu is not equal to 0
95 percent confidence interval:
-3.232207 -2.664620
sample estimates:
difference in location
-2.957496
And from ?wilcox.test ...
"if both x and y are given and paired is FALSE, a Wilcoxon rank sum test
(equivalent to the Mann-Whitney test) is carried out."
But I don't know what to do next. Sorry for all the questions, but I am a
dumb biologist.
Thanks for the help, Dan.
>
>Frank
>
>
More information about the R-help
mailing list