[R] Proper / Improper scoring Rules
Frank E Harrell Jr
f.harrell at vanderbilt.edu
Fri Aug 7 18:45:27 CEST 2009
Donald Catanzaro, PhD wrote:
> Hi All,
>
> I am working on some ordinal logistic regresssions using LRM in the
> Design package. My response variable has three categories (1,2,3) and
> after using the creating my model and using a call to predict some
> values and I wanted to use a simple .5 cut-off to classify my
> probabilities into the categories.
>
> I had two questions:
>
> a) first, I am having trouble directly accessing the probabilities
> which may have more to do with my lack of experience with R
>
> For instance, my calls
>
> >ologit.three.NoPerFor <- lrm(Threshold.Three ~ TECI , data=CLD,
> na.action=na.pass)
> >CLD$Threshold.Predict.Three.NoPerFor<- predict(ologit.three.NoPerFor,
> newdata=CLD, type="fitted.ind")
> >CLD$Threshold.Predict.Three.NoPerFor.Cats[CLD$Threshold.Predict.Three.NoPerFor.Threshold.Three=1
> > .5] <- 1
> Error: unexpected '=' in
> "CLD$Threshold.Predict.Three.NoPerFor.Cats[CLD$Threshold.Predict.Three.NoPerFor.Threshold.Three="
>
> >
> >
>
> produce an error message and it seems as R does not like the equal sign
> at all. So how does one access the probabilities so I can classify them
> into the categories of 1,2,3 so I can look at performance of my model ?
use == to check equality
>
> b) which leads me to my next question. I thought that simply
> calculating the percent correct off of my predictions would be
> sufficient to look at performance but since my question is very much in
> line with this thread
> http://tolstoy.newcastle.edu.au/R/e4/help/08/04/8987.html I am not so
> sure anymore. I am afraid I did not understand Frank Harrell's last
> suggestion regarding improper scoring rule - can someone point me to
> some internet resources that I might be able to review to see why my
> approach would not be valid ?
Percent correct will give you misleading answers and is game-able. It
is also ultra-high-variance. Though not a truly proper scoring rule,
Somers' Dxy rank correlation (generalization of ROC area) is helpful.
Better still: use the log-likelihood and related quantities (deviance,
adequacy index as described in my book).
Frank
>
>
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
More information about the R-help
mailing list