[R] ROC curve in R
Frank E Harrell Jr
f.harrell at vanderbilt.edu
Thu Jul 26 19:45:46 CEST 2007
Dylan Beaudette wrote:
> On Thursday 26 July 2007 06:01, Frank E Harrell Jr wrote:
>> Note that even though the ROC curve as a whole is an interesting
>> 'statistic' (its area is a linear translation of the
>> Wilcoxon-Mann-Whitney-Somers-Goodman-Kruskal rank correlation
>> statistics), each individual point on it is an improper scoring rule,
>> i.e., a rule that is optimized by fitting an inappropriate model. Using
>> curves to select cutoffs is a low-precision and arbitrary operation, and
>> the cutoffs do not replicate from study to study. Probably the worst
>> problem with drawing an ROC curve is that it tempts analysts to try to
>> find cutoffs where none really exist, and it makes analysts ignore the
>> whole field of decision theory.
>>
>> Frank Harrell
>
> Frank,
>
> This thread has caught may attention for a couple reasons, possibly related to
> my novice-level experience.
>
> 1. in a logistic regression study, where i am predicting the probability of
> the response being 1 (for example) - there exists a continuum of probability
> values - and a finite number of {1,0} realities when i either look within the
> original data set, or with a new 'verification' data set. I understand that
> drawing a line through the probabilities returned from the logistic
> regression is a loss of information, but there are times when a 'hard'
> decision requiring prediction of {1,0} is required. I have found that the
> ROCR package (not necessarily the ROC Curve) can be useful in identifying the
> probability cutoff where accuracy is maximized. Is this an unreasonable way
> of using logistic regression as a predictor?
Logistic regression (with suitable attention to not assuming linearity
and to avoiding overfitting) is a great way to estimate P[Y=1]. Given
good predicted P[Y=1] and utilities (losses, costs) for incorrect
positive and negative decisions, an optimal decision is one that
optimizes expected utility. The ROC curve does not play a direct role
in this regard. If per-subject utilities are not available, the analyst
may make various assumptions about utilities (including the unreasonable
but often used assumption that utilities do not vary over subjects) to
find a cutoff on P[Y=1]. A very nice feature of P[Y=1] is that error
probabilities are self-contained. For example if P[Y=1] = .02 for a
single subject and you predict Y=0, the probability of an error is .02
by definition. One doesn't need to compute an overall error probability
over the whole distribution of subjects' risks. If the cost of a false
negative is C, the expected cost is .02*C in this example.
>
> 2. The ROC curve can be a helpful way of communicating false positives / false
> negatives to other users who are less familiar with the output and
> interpretation of logistic regression.
What is more useful than that is a rigorous calibration curve estimate
to demonstrate the faithfulness of predicted P[Y=1] and a histogram
showing the distribution of predicted P[Y=1]. Models that put a lot of
predictions near 0 or 1 are the most discriminating. Calibration curves
and risk distributions are easier to explain than ROC curves. Too often
a statistician will solve for a cutoff on P[Y=1], imposing her own
utility function without querying any subjects.
>
>
> 3. I have been using the area under the ROC Curve, kendall's tau, and cohen's
> kappa to evaluate the accuracy of a logistic regression based prediction, the
> last two statistics based on a some probability cutoff identified before
> hand.
ROC area (equiv. to Wilcoxon-Mann-Whitney and Somers' Dxy rank
correlation between pred. P[Y=1] and Y) is a measure of pure
discrimination, not a measure of accuracy per se. Rank correlation
(concordance) measures do not require the use of cutoffs.
>
>
> How does the topic of decision theory relate to some of the circumstances
> described above? Is there a better way to do some of these things?
See above re: expected loses/utilities.
Good questions.
Frank
>
> Cheers,
>
> Dylan
>
>
>
>> gyadav at ccilindia.co.in wrote:
>>> http://search.r-project.org/cgi-bin/namazu.cgi?query=ROC&max=20&result=no
>>> rmal&sort=score&idxname=Rhelp02a&idxname=functions&idxname=docs
>>>
>>> there is a lot of help try help.search("ROC curve") gave
>>> Help files with alias or concept or title matching 'ROC curve' using
>>> fuzzy matching:
>>>
>>>
>>>
>>> granulo(ade4) Granulometric Curves
>>> plot.roc(analogue) Plot ROC curves and associated
>>> diagnostics
>>> roc(analogue) ROC curve analysis
>>> colAUC(caTools) Column-wise Area Under ROC
>>> Curve (AUC)
>>> DProc(DPpackage) Semiparametric Bayesian ROC
>>> curve analysis
>>> cv.enet(elasticnet) Computes K-fold cross-validated
>>> error curve for elastic net
>>> ROC(Epi) Function to compute and draw
>>> ROC-curves.
>>> lroc(epicalc) ROC curve
>>> cv.lars(lars) Computes K-fold cross-validated
>>> error curve for lars
>>> roc.demo(TeachingDemos) Demonstrate ROC curves by
>>> interactively building one
>>>
>>> HTH
>>> see the help and examples those will suffice
>>>
>>> Type 'help(FOO, package = PKG)' to inspect entry 'FOO(PKG) TITLE'.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Gaurav Yadav
>>> +++++++++++
>>> Assistant Manager, CCIL, Mumbai (India)
>>> Mob: +919821286118 Email: emailtogauravyadav at gmail.com
>>> Bhagavad Gita: Man is made by his Belief, as He believes, so He is
>>>
>>>
>>>
>>> "Rithesh M. Mohan" <rithesh.m at brickworkindia.com>
>>> Sent by: r-help-bounces at stat.math.ethz.ch
>>> 07/26/2007 11:26 AM
>>>
>>> To
>>> <R-help at stat.math.ethz.ch>
>>> cc
>>>
>>> Subject
>>> [R] ROC curve in R
>>>
>>>
>>>
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> I need to build ROC curve in R, can you please provide data steps / code
>>> or guide me through it.
>>>
>>>
>>>
>>> Thanks and Regards
>>>
>>> Rithesh M Mohan
>>>
>>>
>>> [[alternative HTML version deleted]]
>> -
>> Frank E Harrell Jr Professor and Chair School of Medicine
>> Department of Biostatistics Vanderbilt University
>>
More information about the R-help
mailing list