[R] Proper / Improper scoring Rules

Frank E Harrell Jr f.harrell at vanderbilt.edu
Thu Aug 13 13:52:42 CEST 2009

```Donald Catanzaro, PhD wrote:
> Hi All,
>
> I have done more background research (including Frank's book) so I feel
> that my second question is answered.  However, as a novice R user I
> still have the following problem, accessing the output of predict.  So
> simplifying my question, using the example provided in the Design
> package
> (http://lib.stat.cmu.edu/S/Harrell/help/Design/html/predict.lrm.html) I
> might do something like:
>
>> # See help for predict.Design for several binary logistic
>> # regression examples
>>
>> # Examples of predictions from ordinal models
>> set.seed(1)
>> y <- factor(sample(1:3, 400, TRUE), 1:3, c('good','better','best'))
>> x1 <- runif(400)
>> x2 <- runif(400)
>> f <- lrm(y ~ rcs(x1,4)*x2)
>> predict(f, type="fitted.ind")[1:10,]   #gets Prob(better) and all others
>      y=good  y=better    y=best
> 1  0.3124704 0.3631544 0.3243752
> 2  0.3676075 0.3594685 0.2729240
> 3  0.2198274 0.3437416 0.4364309
> 4  0.3063463 0.3629658 0.3306879
> 5  0.5171323 0.3136088 0.1692590
> 6  0.3050115 0.3629071 0.3320813
> 7  0.3532452 0.3612928 0.2854620
> 8  0.2933928 0.3621220 0.3444852
> 9  0.3068595 0.3629867 0.3301538
> 10 0.6214710 0.2612164 0.1173126
>> d <- data.frame(x1=.5,x2=.5)
>> predict(f, d, type="fitted")        # Prob(Y>=j) for new observation
> y>=better   y>=best 0.6906593 0.3275849
>> predict(f, d, type="fitted.ind")    # Prob(Y=j)
>   y=good  y=better    y=best 0.3093407 0.3630744 0.3275849
>
> So now if I wanted to do
>
>> out <- predict(f, d, type="fitted.ind")>
>
>> out
>
>   y=good  y=better    y=best
> 0.3093407 0.3630744 0.3275849
>> out\$"y=better"
>
> Error in out\$"y=better" : \$ operator is invalid for atomic vectors

Recognize what type of object out is, and address its elements
accordingly.  E.g. out[,'y=best'].

Frank

>
>>
>
>
> y=better is the max, so how do I create something that says that ?
> (which is not exactly what I want to do but close enough to help me
> figure out what R code I need to accomplish the task)
>
> I can push the predictions out to a vector:
>
> out.vector <- as.vector(predict(f, d, type="fitted.ind"))
>
>> out.vector
>
> [1] 0.3093407 0.3630744 0.3275849
>
>
> which gets me part of the way because I can find out max(out.vector) but
> I still need to know what column the max is in. I think the problem is
> that I don't know how to manipulate data frames and vectors in R and
> need some guidance
>
> -Don
> Don Catanzaro, PhD                  Landscape Ecologist
> dgcatanzaro at gmail.com
> 16144 Sigmond Lane
> Lowell, AR 72745
> 479-751-3616
>
>
>
> Frank E Harrell Jr wrote:
>> Donald Catanzaro, PhD wrote:
>>> Hi All,
>>>
>>> I am working on some ordinal logistic regresssions using LRM in the
>>> Design package.  My response variable has three categories (1,2,3)
>>> and after using the creating my model and using a call to predict
>>> some values and I wanted to use a simple .5 cut-off to classify my
>>> probabilities into the categories.
>>>
>>> I had two questions:
>>>
>>> a)  first, I am having trouble directly accessing the probabilities
>>> which may have more to do with my lack of experience with R
>>>
>>> For instance, my calls
>>>
>>>  >ologit.three.NoPerFor <- lrm(Threshold.Three ~ TECI , data=CLD,
>>> na.action=na.pass)
>>>  >CLD\$Threshold.Predict.Three.NoPerFor<-
>>> predict(ologit.three.NoPerFor, newdata=CLD, type="fitted.ind")
>>>  >CLD\$Threshold.Predict.Three.NoPerFor.Cats[CLD\$Threshold.Predict.Three.NoPerFor.Threshold.Three=1
>>>  > .5] <- 1
>>> Error: unexpected '=' in
>>> "CLD\$Threshold.Predict.Three.NoPerFor.Cats[CLD\$Threshold.Predict.Three.NoPerFor.Threshold.Three="
>>>
>>>  >
>>>  >
>>>
>>> produce an error message and it seems as R does not like the equal
>>> sign at all.  So how does one access the probabilities so I can
>>> classify them into the categories of 1,2,3 so I can look at
>>> performance of my model ?
>>
>> use == to check equality
>>
>>>
>>> b)  which leads me to my next question.  I thought that simply
>>> calculating the percent correct off of my predictions would be
>>> sufficient to look at performance but since my question is very much
>>> in line with this thread
>>> http://tolstoy.newcastle.edu.au/R/e4/help/08/04/8987.html I am not so
>>> sure anymore.  I am afraid I did not understand Frank Harrell's last
>>> suggestion regarding improper scoring rule - can someone point me to
>>> some internet resources that I might be able to review to see why my
>>> approach would not be valid ?
>>
>> Percent correct will give you misleading answers and is game-able.  It
>> is also ultra-high-variance.  Though not a truly proper scoring rule,
>> Somers' Dxy rank correlation (generalization of ROC area) is helpful.
>> Better still: use the log-likelihood and related quantities (deviance,
>> adequacy index as described in my book).
>>
>> Frank
>>
>>>
>>>
>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

--
Frank E Harrell Jr   Professor and Chair           School of Medicine
Department of Biostatistics   Vanderbilt University

```