[R] Proper / Improper scoring Rules

Fri Aug 7 18:45:27 CEST 2009

Donald Catanzaro, PhD wrote:
> Hi All,
> 
> I am working on some ordinal logistic regresssions using LRM in the 
> Design package.  My response variable has three categories (1,2,3) and 
> after using the creating my model and using a call to predict some 
> values and I wanted to use a simple .5 cut-off to classify my 
> probabilities into the categories.
> 
> I had two questions:
> 
> a)  first, I am having trouble directly accessing the probabilities 
> which may have more to do with my lack of experience with R
> 
> For instance, my calls
> 
>  >ologit.three.NoPerFor <- lrm(Threshold.Three ~ TECI , data=CLD, 
> na.action=na.pass)
>  >CLD$Threshold.Predict.Three.NoPerFor<- predict(ologit.three.NoPerFor, 
> newdata=CLD, type="fitted.ind") 
>  >CLD$Threshold.Predict.Three.NoPerFor.Cats[CLD$Threshold.Predict.Three.NoPerFor.Threshold.Three=1 
>  > .5] <- 1
> Error: unexpected '=' in 
> "CLD$Threshold.Predict.Three.NoPerFor.Cats[CLD$Threshold.Predict.Three.NoPerFor.Threshold.Three=" 
> 
>  >
>  >
> 
> produce an error message and it seems as R does not like the equal sign 
> at all.  So how does one access the probabilities so I can classify them 
> into the categories of 1,2,3 so I can look at performance of my model ?

use == to check equality

> 
> b)  which leads me to my next question.  I thought that simply 
> calculating the percent correct off of my predictions would be 
> sufficient to look at performance but since my question is very much in 
> line with this thread 
> http://tolstoy.newcastle.edu.au/R/e4/help/08/04/8987.html I am not so 
> sure anymore.  I am afraid I did not understand Frank Harrell's last 
> suggestion regarding improper scoring rule - can someone point me to 
> some internet resources that I might be able to review to see why my 
> approach would not be valid ?

Percent correct will give you misleading answers and is game-able.  It 
is also ultra-high-variance.  Though not a truly proper scoring rule, 
Somers' Dxy rank correlation (generalization of ROC area) is helpful. 
Better still: use the log-likelihood and related quantities (deviance, 
adequacy index as described in my book).

Frank

> 
> 

-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University