[R] ROC curve in R

Sat Aug 4 15:29:36 CEST 2007

Dylan Beaudette wrote:
> On Thursday 26 July 2007 10:45, Frank E Harrell Jr wrote:
>> Dylan Beaudette wrote:
>>> On Thursday 26 July 2007 06:01, Frank E Harrell Jr wrote:
>>>> Note that even though the ROC curve as a whole is an interesting
>>>> 'statistic' (its area is a linear translation of the
>>>> Wilcoxon-Mann-Whitney-Somers-Goodman-Kruskal rank correlation
>>>> statistics), each individual point on it is an improper scoring rule,
>>>> i.e., a rule that is optimized by fitting an inappropriate model.  Using
>>>> curves to select cutoffs is a low-precision and arbitrary operation, and
>>>> the cutoffs do not replicate from study to study.  Probably the worst
>>>> problem with drawing an ROC curve is that it tempts analysts to try to
>>>> find cutoffs where none really exist, and it makes analysts ignore the
>>>> whole field of decision theory.
>>>>
>>>> Frank Harrell
>>> Frank,
>>>
>>> This thread has caught may attention for a couple reasons, possibly
>>> related to my novice-level experience.
>>>
>>> 1. in a logistic regression study, where i am predicting the probability
>>> of the response being 1 (for example) - there exists a continuum of
>>> probability values - and a finite number of {1,0} realities when i either
>>> look within the original data set, or with a new 'verification' data set.
>>> I understand that drawing a line through the probabilities returned from
>>> the logistic regression is a loss of information, but there are times
>>> when a 'hard' decision requiring prediction of {1,0} is required. I have
>>> found that the ROCR package (not necessarily the ROC Curve) can be useful
>>> in identifying the probability cutoff where accuracy is maximized. Is
>>> this an unreasonable way of using logistic regression as a predictor?
> 
> Thanks for the detailed response Frank. My follow-up questions are below:
> 
>> Logistic regression (with suitable attention to not assuming linearity
>> and to avoiding overfitting) is a great way to estimate P[Y=1].  Given
>> good predicted P[Y=1] and utilities (losses, costs) for incorrect
>> positive and negative decisions, an optimal decision is one that
>> optimizes expected utility.  The ROC curve does not play a direct role
>> in this regard.  
> 
> Ok.
> 
>> If per-subject utilities are not available, the analyst 
>> may make various assumptions about utilities (including the unreasonable
>> but often used assumption that utilities do not vary over subjects) to
>> find a cutoff on P[Y=1]. 
> 
> Can you elaborate on what exactly a "per-subject utility" is? In my case, I am 
> trying to predict the occurance of specific soil features based on two 
> predictor variables: 1 continuous, the other categorical.  Thus far my 
> evaluation of how well this method works is based on how often I can 
> correctly predict (a categorical) quality.

This could be called a per-unit utility in your case.  It is the 
consequence of decisions at the point in which you decide Y=0 or Y=1. 
If consequences are the same over all units, you just have to deal with 
the single ratio of cost of false positive to cost of false negative.

One way to limit bad consequences is to not make any decision when the 
predicted probability is in the middle, i.e., the decision is 'obtain 
more data'.  That is a real advantage of having a continuous risk estimate.

> 
> 
>> A very nice feature of P[Y=1] is that error 
>> probabilities are self-contained.  For example if P[Y=1] = .02 for a
>> single subject and you predict Y=0, the probability of an error is .02
>> by definition.  One doesn't need to compute an overall error probability
>> over the whole distribution of subjects' risks.  If the cost of a false
>> negative is C, the expected cost is .02*C in this example.
> 
> Interesting. The hang-up that I am having is that I need to predict from 
> {O,1}, as the direct users of this information are not currently interested 
> in in raw probabilities. As far as I know, in order to predict a class from a 
> probability I need use a cutoff... How else can I accomplish this without 
> imposing a cutoff on the entire dataset? One thought, identify a cutoff for 
> each level of the categorical predictor term in the model... (?)

You're right you have to ultimately use a cutoff (or better still, 
educate the users about the meaning of probabilities and let them make 
the decision without exposing the cutoff).  And see the comment 
regarding gray zones above.

> 
>>> 2. The ROC curve can be a helpful way of communicating false positives /
>>> false negatives to other users who are less familiar with the output and
>>> interpretation of logistic regression.
>> What is more useful than that is a rigorous calibration curve estimate
>> to demonstrate the faithfulness of predicted P[Y=1] and a histogram
>> showing the distribution of predicted P[Y=1]
> 
> Ok. I can make that histogram - how would one go about making the 'rigorous 
> calibration curve' ? Note that I have a training set, from which the model is 
> built, and a smaller testing set for evaluation. 

See the val.prob function in the Design package.  This assumes your test 
samples and training samples are both large and are independent. 
Otherwise data splitting is too noisy a method and you might consider 
calibrate.lrm in Design, fitting all the data.

> 
> 
>> .  Models that put a lot of 
>> predictions near 0 or 1 are the most discriminating.  Calibration curves
>> and risk distributions are easier to explain than ROC curves.
> 
> By 'risk discrimination' do you mean said histogram ?

yes

> 
>> Too often 
>> a statistician will solve for a cutoff on P[Y=1], imposing her own
>> utility function without querying any subjects.
> 
> in this case I have picked a cutoff that resulted in the smallest number of 
> incorrectly classified observations , or highest kappa / tau statistics -- 
> the results were very close.

Proportion of incorrect classifications is an improper scoring rule that 
tells you about the average performance of the method over all of the 
units.  It is not that helpful for an individual unit, as all units may 
have different predicted probabilities.  Because it's improper, you will 
find examples where a powerful variable is added to a model and the 
percent classified correctly decreases.

> 
> 
>>> 3. I have been using the area under the ROC Curve, kendall's tau, and
>>> cohen's kappa to evaluate the accuracy of a logistic regression based
>>> prediction, the last two statistics based on a some probability cutoff
>>> identified before hand.
>> ROC area (equiv. to Wilcoxon-Mann-Whitney and Somers' Dxy rank
>> correlation between pred. P[Y=1] and Y) is a measure of pure
>> discrimination, not a measure of accuracy per se.  Rank correlation
>> (concordance) measures do not require the use of cutoffs.
> 
> Ok. Hopefully I am not abusing the kappa and tau statistics too badly by using 
> them to evaluate a probability cutoff... (?)

Kappa, tau, Dxy, gamma, ROC area are all functions of the continuous 
predicted risks and the observed Y=0,1.  They don't deal with cutoffs.

> 
>>> How does the topic of decision theory relate to some of the circumstances
>>> described above? Is there a better way to do some of these things?
>> See above re: expected loses/utilities.

Decision theory helps you translate maximum current information (often 
summarized in a predicted risk) and utilities/losses/costs to decisions. 
  I'm looking for a great background article on this; some useful stuff 
is in the Encyclopedia of Statistical Sciences but other people may find 
some great references for us.

Frank

>>
>> Good questions.
>>
>> Frank
> 
> Thanks for the feedback.
> 
> Cheers,
> 
> Dylan
> 
> 
>>> Cheers,
>>>
>>> Dylan
>>>
>>>> gyadav at ccilindia.co.in wrote:
>>>>> http://search.r-project.org/cgi-bin/namazu.cgi?query=ROC&max=20&result=
>>>>> no rmal&sort=score&idxname=Rhelp02a&idxname=functions&idxname=docs
>>>>>
>>>>> there is a lot of help try help.search("ROC curve") gave
>>>>> Help files with alias or concept or title matching 'ROC curve' using
>>>>> fuzzy matching:
>>>>>
>>>>>
>>>>>
>>>>> granulo(ade4)                             Granulometric Curves
>>>>> plot.roc(analogue)                        Plot ROC curves and
>>>>> associated diagnostics
>>>>> roc(analogue)                             ROC curve analysis
>>>>> colAUC(caTools)                           Column-wise Area Under ROC
>>>>> Curve (AUC)
>>>>> DProc(DPpackage)                          Semiparametric Bayesian ROC
>>>>> curve analysis
>>>>> cv.enet(elasticnet)                       Computes K-fold
>>>>> cross-validated error curve for elastic net
>>>>> ROC(Epi)                                  Function to compute and draw
>>>>> ROC-curves.
>>>>> lroc(epicalc)                             ROC curve
>>>>> cv.lars(lars)                             Computes K-fold
>>>>> cross-validated error curve for lars
>>>>> roc.demo(TeachingDemos)                   Demonstrate ROC curves by
>>>>> interactively building one
>>>>>
>>>>> HTH
>>>>> see the help and examples those will suffice
>>>>>
>>>>> Type 'help(FOO, package = PKG)' to inspect entry 'FOO(PKG) TITLE'.
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Gaurav Yadav
>>>>> +++++++++++
>>>>> Assistant Manager, CCIL, Mumbai (India)
>>>>> Mob: +919821286118 Email: emailtogauravyadav at gmail.com
>>>>> Bhagavad Gita:  Man is made by his Belief, as He believes, so He is
>>>>>
>>>>>
>>>>>
>>>>> "Rithesh M. Mohan" <rithesh.m at brickworkindia.com>
>>>>> Sent by: r-help-bounces at stat.math.ethz.ch
>>>>> 07/26/2007 11:26 AM
>>>>>
>>>>> To
>>>>> <R-help at stat.math.ethz.ch>
>>>>> cc
>>>>>
>>>>> Subject
>>>>> [R] ROC curve in R
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> I need to build ROC curve in R, can you please provide data steps /
>>>>> code or guide me through it.
>>>>>
>>>>>
>>>>>
>>>>> Thanks and Regards
>>>>>
>>>>> Rithesh M Mohan
>>>>>
>>>>>
>>>>>                  [[alternative HTML version deleted]]
>>>> -
>>>> Frank E Harrell Jr   Professor and Chair           School of Medicine
>>>>                       Department of Biostatistics   Vanderbilt
>>>> University