[R] ROC curve in R

Sat Aug 4 01:09:20 CEST 2007

On Thursday 26 July 2007 10:45, Frank E Harrell Jr wrote:
> Dylan Beaudette wrote:
> > On Thursday 26 July 2007 06:01, Frank E Harrell Jr wrote:
> >> Note that even though the ROC curve as a whole is an interesting
> >> 'statistic' (its area is a linear translation of the
> >> Wilcoxon-Mann-Whitney-Somers-Goodman-Kruskal rank correlation
> >> statistics), each individual point on it is an improper scoring rule,
> >> i.e., a rule that is optimized by fitting an inappropriate model.  Using
> >> curves to select cutoffs is a low-precision and arbitrary operation, and
> >> the cutoffs do not replicate from study to study.  Probably the worst
> >> problem with drawing an ROC curve is that it tempts analysts to try to
> >> find cutoffs where none really exist, and it makes analysts ignore the
> >> whole field of decision theory.
> >>
> >> Frank Harrell
> >
> > Frank,
> >
> > This thread has caught may attention for a couple reasons, possibly
> > related to my novice-level experience.
> >
> > 1. in a logistic regression study, where i am predicting the probability
> > of the response being 1 (for example) - there exists a continuum of
> > probability values - and a finite number of {1,0} realities when i either
> > look within the original data set, or with a new 'verification' data set.
> > I understand that drawing a line through the probabilities returned from
> > the logistic regression is a loss of information, but there are times
> > when a 'hard' decision requiring prediction of {1,0} is required. I have
> > found that the ROCR package (not necessarily the ROC Curve) can be useful
> > in identifying the probability cutoff where accuracy is maximized. Is
> > this an unreasonable way of using logistic regression as a predictor?

Thanks for the detailed response Frank. My follow-up questions are below:

> Logistic regression (with suitable attention to not assuming linearity
> and to avoiding overfitting) is a great way to estimate P[Y=1].  Given
> good predicted P[Y=1] and utilities (losses, costs) for incorrect
> positive and negative decisions, an optimal decision is one that
> optimizes expected utility.  The ROC curve does not play a direct role
> in this regard.  

Ok.

> If per-subject utilities are not available, the analyst 
> may make various assumptions about utilities (including the unreasonable
> but often used assumption that utilities do not vary over subjects) to
> find a cutoff on P[Y=1]. 

Can you elaborate on what exactly a "per-subject utility" is? In my case, I am 
trying to predict the occurance of specific soil features based on two 
predictor variables: 1 continuous, the other categorical.  Thus far my 
evaluation of how well this method works is based on how often I can 
correctly predict (a categorical) quality.

> A very nice feature of P[Y=1] is that error 
> probabilities are self-contained.  For example if P[Y=1] = .02 for a
> single subject and you predict Y=0, the probability of an error is .02
> by definition.  One doesn't need to compute an overall error probability
> over the whole distribution of subjects' risks.  If the cost of a false
> negative is C, the expected cost is .02*C in this example.

Interesting. The hang-up that I am having is that I need to predict from 
{O,1}, as the direct users of this information are not currently interested 
in in raw probabilities. As far as I know, in order to predict a class from a 
probability I need use a cutoff... How else can I accomplish this without 
imposing a cutoff on the entire dataset? One thought, identify a cutoff for 
each level of the categorical predictor term in the model... (?)

> > 2. The ROC curve can be a helpful way of communicating false positives /
> > false negatives to other users who are less familiar with the output and
> > interpretation of logistic regression.
>
> What is more useful than that is a rigorous calibration curve estimate
> to demonstrate the faithfulness of predicted P[Y=1] and a histogram
> showing the distribution of predicted P[Y=1]

Ok. I can make that histogram - how would one go about making the 'rigorous 
calibration curve' ? Note that I have a training set, from which the model is 
built, and a smaller testing set for evaluation. 

> .  Models that put a lot of 
> predictions near 0 or 1 are the most discriminating.  Calibration curves
> and risk distributions are easier to explain than ROC curves.

By 'risk discrimination' do you mean said histogram ?

> Too often 
> a statistician will solve for a cutoff on P[Y=1], imposing her own
> utility function without querying any subjects.

in this case I have picked a cutoff that resulted in the smallest number of 
incorrectly classified observations , or highest kappa / tau statistics -- 
the results were very close.

> > 3. I have been using the area under the ROC Curve, kendall's tau, and
> > cohen's kappa to evaluate the accuracy of a logistic regression based
> > prediction, the last two statistics based on a some probability cutoff
> > identified before hand.
>
> ROC area (equiv. to Wilcoxon-Mann-Whitney and Somers' Dxy rank
> correlation between pred. P[Y=1] and Y) is a measure of pure
> discrimination, not a measure of accuracy per se.  Rank correlation
> (concordance) measures do not require the use of cutoffs.

Ok. Hopefully I am not abusing the kappa and tau statistics too badly by using 
them to evaluate a probability cutoff... (?)

> > How does the topic of decision theory relate to some of the circumstances
> > described above? Is there a better way to do some of these things?
>
> See above re: expected loses/utilities.
>
> Good questions.
>
> Frank

Thanks for the feedback.

Cheers,

Dylan

> > Cheers,
> >
> > Dylan
> >
> >> gyadav at ccilindia.co.in wrote:
> >>> http://search.r-project.org/cgi-bin/namazu.cgi?query=ROC&max=20&result=
> >>>no rmal&sort=score&idxname=Rhelp02a&idxname=functions&idxname=docs
> >>>
> >>> there is a lot of help try help.search("ROC curve") gave
> >>> Help files with alias or concept or title matching 'ROC curve' using
> >>> fuzzy matching:
> >>>
> >>>
> >>>
> >>> granulo(ade4)                             Granulometric Curves
> >>> plot.roc(analogue)                        Plot ROC curves and
> >>> associated diagnostics
> >>> roc(analogue)                             ROC curve analysis
> >>> colAUC(caTools)                           Column-wise Area Under ROC
> >>> Curve (AUC)
> >>> DProc(DPpackage)                          Semiparametric Bayesian ROC
> >>> curve analysis
> >>> cv.enet(elasticnet)                       Computes K-fold
> >>> cross-validated error curve for elastic net
> >>> ROC(Epi)                                  Function to compute and draw
> >>> ROC-curves.
> >>> lroc(epicalc)                             ROC curve
> >>> cv.lars(lars)                             Computes K-fold
> >>> cross-validated error curve for lars
> >>> roc.demo(TeachingDemos)                   Demonstrate ROC curves by
> >>> interactively building one
> >>>
> >>> HTH
> >>> see the help and examples those will suffice
> >>>
> >>> Type 'help(FOO, package = PKG)' to inspect entry 'FOO(PKG) TITLE'.
> >>>
> >>>
> >>>
> >>> Regards,
> >>>
> >>> Gaurav Yadav
> >>> +++++++++++
> >>> Assistant Manager, CCIL, Mumbai (India)
> >>> Mob: +919821286118 Email: emailtogauravyadav at gmail.com
> >>> Bhagavad Gita:  Man is made by his Belief, as He believes, so He is
> >>>
> >>>
> >>>
> >>> "Rithesh M. Mohan" <rithesh.m at brickworkindia.com>
> >>> Sent by: r-help-bounces at stat.math.ethz.ch
> >>> 07/26/2007 11:26 AM
> >>>
> >>> To
> >>> <R-help at stat.math.ethz.ch>
> >>> cc
> >>>
> >>> Subject
> >>> [R] ROC curve in R
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Hi,
> >>>
> >>>
> >>>
> >>> I need to build ROC curve in R, can you please provide data steps /
> >>> code or guide me through it.
> >>>
> >>>
> >>>
> >>> Thanks and Regards
> >>>
> >>> Rithesh M Mohan
> >>>
> >>>
> >>>                  [[alternative HTML version deleted]]
> >>
> >> -
> >> Frank E Harrell Jr   Professor and Chair           School of Medicine
> >>                       Department of Biostatistics   Vanderbilt
> >> University