[R] predict.lrm ( Design package) poor performance?
Chris Mcowen
chrismcowen at gmail.com
Tue Sep 21 12:24:16 CEST 2010
Thanks Frank,
I have one small question regarding this, understand you are very busy and if you cant answer i would greatly appreciate any thoughts from the list.
> Split-sample validation is not reliable unless you have say 10,000 samples to begin with
I am a little confused. When i ran the model with a binary response i had a Brier score of 0.199 and a C-value of 0.73. When i looked at the data, 75% of the predictions were "correct".
However when i ran the model with a ordinal response, the Brier score was 0.201 and the c-value 0.677 which to me, albeit with limited knowledge in the area, suggests the model performance worse but not by a great deal. However when i compare the predicted mean values with the "real" data i only get 45% accuracy, it is this discrepancy i don't understand.
I appreciate you need a large dataset to partition it, but surely if the c-value and brier score are similar then the number predicted "correct" should also be similar?
Thanks
Chris
On 20 Sep 2010, at 14:46, Frank Harrell wrote:
A few comments; sorry I don't have time for any more.
- Combining categories is almost always a bad idea
- It can be harder to discriminate more categories but that's only because the task is more difficult
- Split-sample validation is not reliable unless you have say 10,000 samples to begin with; otherwise 100 fold repeats of 10-fold cross-validation, or (faster) 200 or so bootstraps is better. Be sure to re-do all modeling decisions from scratch for each re-sample. If you did no statistical variable selection this may not be an issue.
- You don't need ".lrm" after predict
- Not sure why your variable names are all upper case; harder to read this way
Good luck
Frank
Frank E Harrell Jr Professor and Chairman School of Medicine
Department of Biostatistics Vanderbilt University
On Mon, 20 Sep 2010, Chris Mcowen wrote:
> Dear Professor Harell
> I am familier with binary models, however i am now trying to get predictions from a ordinal model and have a
> question.
> I have a data set made up of 12 categorical predictors, the response variable is classed as 1,2,3,4,5,6, this
> relates to threat level of the species ( on the IUCN rating).
> Previously i have combined levels 1 and 2 to form = non threatened and then combined 3-6 to form threatened, and run
> a binary model. I have tested the performance of this based on the brier score (0.198) and the AUC or C value
> (0.75). I also partitioned the data set into training and test data and used the predict function to get a predicted
> probability for the newdata. When visualising the results with a cutoff value calculated with epi, roughly 75% of
> the time the prediction was correct.
> Now i am interested in predicting the threat level of a species not purely as threatened or not but to specific IUCN
> levels. I have used the predict.lrm function (predict.lrm(model1, type="fitted.ind")) to generate probabilities for
> each level, and also (predict(model1, traist, type="fitted")) see below.
> When i call the model the Brier score is 0.201 and C value 0.677. However when i inspect the output and relate it
> to the corresponding species ( for which i know the true IUCN rating) the model performs very badly, only getting
> 43% correct. Interestingly i have noticed the probabilities are always highest for levels 1 and 6, on no occasion do
> levels 2,3,4 or 5 have high probabilities?
> I am unsure if this is just because the model can not discriminate between the various levels due to insufficient
> data? Or if i am doing something wrong, if this is the case i would greatly appreciate any advice or suggestion of a
> different method.
> Thanks in advance,
> Chris
> model1 <- lrm(EXTINCTION~BS*FR+WO+LIF+REG+ALT+BIO+HAB+PD+SEA, x=TRUE,y=TRUE)
> predict.lrm(model1, type="fitted.ind")
> EXTINCTION=1 EXTINCTION=2 EXTINCTION=3 EXTINCTION=4 EXTINCTION=5
> 1 0.19748393 0.05895033 0.12811186 0.086140778 0.068137104
> 2 0.27790178 0.07247496 0.14384976 0.087487677 0.064584865
> 3 0.24605628 0.06777939 0.13931242 0.087996215 0.066625303
> 4 0.24605628 0.06777939 0.13931242 0.087996215 0.066625303
> 5 0.24605628 0.06777939 0.13931242 0.087996215 0.066625303
> 6 0.24605628 0.06777939 0.13931242 0.087996215 0.066625303
> 7 0.13928899 0.04558050 0.10636220 0.077770389 0.065500459
> 8 0.24605628 0.06777939 0.13931242 0.087996215 0.066625303
> 9 0.24605628 0.06777939 0.13931242 0.087996215 0.066625303
> 10 0.33865077 0.07915126 0.14744522 0.083923247 0.059387585
> EXTINCTION=6
> 1 0.46117600
> 2 0.35370096
> 3 0.39223038
> 4 0.39223038
> 5 0.39223038
> 6 0.39223038
> 7 0.56549746
> 8 0.39223038
> 9 0.39223038
> predict(model1, traist, type="fitted")
> y>=2 y>=3 y>=4 y>=5 y>=6
> 1 0.80251607 0.74356575 0.61545388 0.52931311 0.46117600
> 2 0.72209822 0.64962327 0.50577351 0.41828583 0.35370096
> 3 0.75394372 0.68616432 0.54685190 0.45885569 0.39223038
> 4 0.75394372 0.68616432 0.54685190 0.45885569 0.39223038
> 5 0.75394372 0.68616432 0.54685190 0.45885569 0.39223038
> 6 0.75394372 0.68616432 0.54685190 0.45885569 0.39223038
> 7 0.86071101 0.81513051 0.70876831 0.63099792 0.56549746
> 8 0.75394372 0.68616432 0.54685190 0.45885569 0.39223038
> 9 0.75394372 0.68616432 0.54685190 0.45885569 0.39223038
> 10 0.66134923 0.58219797 0.43475276 0.35082951 0.29144192
>
More information about the R-help
mailing list