[R] Questions about lrm, validate, pentrace

Tue Apr 26 16:53:10 CEST 2011

Thank you for you quick reply, Prof. Harrell.
According to your advice, I ran pentrace using a very wide range.

 > pentrace.x6factor <- pentrace(x6factor.lrm, seq(0, 100, by=0.5))
 > plot(pentrace.x6factor)

I attached this figure. Then,

 > pentrace.x6factor <- pentrace(x6factor.lrm, seq(0, 10, by=0.05))

It seems reasonable that the best penalty is 2.55.

 > x6factor.lrm.pen <- update(x6factor.lrm, penalty=2.55)
 > cbind(coef(x6factor.lrm), coef(x6factor.lrm.pen), 
abs(coef(x6factor.lrm)-coef(x6factor.lrm.pen)))
                      [,1]        [,2]        [,3]
Intercept     -4.32434556 -3.86816460 0.456180958
stenosis      -0.01496757 -0.01091755 0.004050025
T1             3.04248257  2.42443034 0.618052225
T2            -0.75335619 -0.57194342 0.181412767
procedure     -1.20847252 -0.82589263 0.382579892
ClinicalScore  0.37623189  0.30524628 0.070985611

 > validate(x6factor.lrm, bw=F, B=200)
           index.orig training    test optimism index.corrected   n
Dxy           0.6324   0.6849  0.5955   0.0894          0.5430 200
R2            0.3668   0.4220  0.3231   0.0989          0.2679 200
Intercept     0.0000   0.0000 -0.1924   0.1924         -0.1924 200
Slope         1.0000   1.0000  0.7796   0.2204          0.7796 200
Emax          0.0000   0.0000  0.0915   0.0915          0.0915 200
D             0.2716   0.3229  0.2339   0.0890          0.1826 200
U            -0.0192  -0.0192  0.0243  -0.0436          0.0243 200
Q             0.2908   0.3422  0.2096   0.1325          0.1582 200
B             0.1272   0.1171  0.1357  -0.0186          0.1457 200
g             1.6328   1.9879  1.4940   0.4939          1.1389 200
gp            0.2367   0.2502  0.2216   0.0286          0.2080 200

 > validate(x6factor.lrm.pen, bw=F, B=200)
           index.orig training    test optimism index.corrected   n
Dxy           0.6375   0.6857  0.6024   0.0833          0.5542 200
R2            0.3145   0.3488  0.3267   0.0221          0.2924 200
Intercept     0.0000   0.0000  0.0882  -0.0882          0.0882 200
Slope         1.0000   1.0000  1.0923  -0.0923          1.0923 200
Emax          0.0000   0.0000  0.0340   0.0340          0.0340 200
D             0.2612   0.2571  0.2370   0.0201          0.2411 200
U            -0.0192  -0.0192 -0.0047  -0.0145         -0.0047 200
Q             0.2805   0.2763  0.2417   0.0346          0.2458 200
B             0.1292   0.1224  0.1355  -0.0132          0.1423 200
g             1.2704   1.3917  1.5019  -0.1102          1.3805 200
gp            0.2020   0.2091  0.2229  -0.0138          0.2158 200

In the penalized model (x6factor.lrm.pen), the apparent Dxy is 0.64, and 
bias-corrected Dxy is 0.55. The maximum absolute error is estimated to 
be 0.034, smaller than non-penalized model (0.0915 in x6factor.lrm) The 
changes in slope and intercept are substantially reduced in penalized 
model. I think overfitting is improved at least to some extent. Should I 
select this as a final model?

I have one more question. The "procedure" variable was defined as 0/1 
value in the previous mail. For some graphical reason, I redefined it as 
treat1/treat2 value. Then, the best penalty value was changed from 3.05 
to 2.55. I guess change from numeric to factorial caused this reduction 
in penalty. Which set up should I select?

I appreciate your help in advance.

-- 
KH

(11/04/26 0:21), Frank Harrell wrote:
> You've done a lot of good work on this.  Yes I would say you have moderate
> overfitting with the first model.  The only thing that saved you from having
> severe overfitting is that there seems to be a signal present [I am assume
> this model is truly pre-specified and was not developed at all by looking at
> patterns of responses Y.]
>
> The use of backwards stepdown demonstrated much worse overfitting.  This is
> in line with what we know about the damage of stepwise selection methods
> that do not incorporate shrinkage.  I would throw away the stepwise
> regression model.  You'll find that the model selected is entirely
> arbitrary.  And you can't use the "selected" variables in any re-fit of the
> model, i.e., you can't use lrm pretending that the two remaining variables
> were pre-specified.  Stepwise regression methods only seem to help.  When
> assessed properly we see that is an illusion.
>
> You are using penalizing properly but you did not print the full table of
> penalties vs. effective AIC.  We don't have faith that you penalized enough.
> I tend to run pentrace using a very wide range of possible penalties to make
> sure I've found the global optimum.
>
> Penalization somewhat solves the EPV problem but there is no substitute for
> getting more data.
>
> You can run validate specifying your final penalty as an argument.
>
> Frank
>
>
>
> 細田弘吉 wrote:
>>
>> According to the advice, I tried rms package.
>> Just to make sure, I have data of 104 patients (x6.df), which consists
>> of 5 explanatory variables and one binary outcome (poor/good) (previous
>> model 2 strategy). The outcome consists of 25 poor results and 79 good
>> results. Therefore, My events per variable (EPV) is only 5 (much less
>> than the rule of thumb of 10).
>>
>> My questions are about validate and pentrace in rms package.
>> I present some codes and results.
>> I appreciate anybody's help in advance.
>>
>>   >  x6.lrm<- lrm(outcome ~ stenosis+x1+x2+procedure+ClinicalScore,
>> data=x6.df, x=T, y=T)
>>
>>   >  x6.lrm
>> ...
>> Obs  104    LR chi2      29.24    R2       0.367    C       0.816
>>    negative 79    d.f.         5    g        1.633    Dxy     0.632
>>    positive 25    Pr(>  chi2)<0.0001   gr    5.118    gamma   0.632
>> max |deriv| 1e-08                    gp    0.237    tau-a   0.233
>>                                        Brier   0.127
>>
>>                  Coef    S.E.   Wald Z Pr(>|Z|)
>> Intercept      -5.5328 2.6287 -2.10  0.0353
>> stenosis       -0.0150 0.0284 -0.53  0.5979
>> x1              3.0425 0.9100  3.34  0.0008
>> x2             -0.7534 0.4519 -1.67  0.0955
>> procedure       1.2085 0.5717  2.11  0.0345
>> ClinicalScore   0.3762 0.2287  1.65  0.0999
>>
>> It seems not too bad. Next, validation by bootstrap ...
>>
>>   >  validate(x6.lrm, B=200, bw=F)
>>             index.orig training    test optimism index.corrected   n
>> Dxy           0.6324   0.6960  0.5870   0.1091          0.5233 200
>> R2            0.3668   0.4370  0.3154   0.1216          0.2453 200
>> Intercept     0.0000   0.0000 -0.2007   0.2007         -0.2007 200
>> Slope         1.0000   1.0000  0.7565   0.2435          0.7565 200
>> Emax          0.0000   0.0000  0.0999   0.0999          0.0999 200
>> D             0.2716   0.3368  0.2275   0.1093          0.1623 200
>> U            -0.0192  -0.0192  0.0369  -0.0561          0.0369 200
>> Q             0.2908   0.3560  0.1906   0.1654          0.1254 200
>> B             0.1272   0.1155  0.1384  -0.0229          0.1501 200
>> g             1.6328   2.0740  1.4647   0.6093          1.0235 200
>> gp            0.2367   0.2529  0.2189   0.0341          0.2026 200
>>
>> The apparent Dxy is 0.63, and bias-corrected Dxy is 0.52. The maximum
>> absolute error is estimated to be 0.099. The changes in slope and
>> intercept are also more substantial. In all, there is evidence that I am
>> somewhat overfitting the data, right?.
>>
>> Furthermore, using step-down variable selection ...
>>
>>   >  validate(x6.lrm, B=200, bw=T)
>>
>> 		Backwards Step-down - Original Model
>>
>>    Deleted        Chi-Sq d.f. P      Residual d.f. P      AIC
>>    stenosis       0.28   1    0.5979 0.28     1    0.5979 -1.72
>>    ClinicalScore  2.60   1    0.1068 2.88     2    0.2370 -1.12
>>    x2             2.86   1    0.0910 5.74     3    0.1252 -0.26
>>
>> Approximate Estimates after Deleting Factors
>>
>>                Coef   S.E. Wald Z         P
>> Intercept  -5.865 1.4136 -4.149 3.336e-05
>> x1          2.915 0.8685  3.357 7.889e-04
>> procedure   1.072 0.5590  1.918 5.508e-02
>>
>> Factors in Final Model
>>
>> [1] x1         procedure
>>             index.orig training    test optimism index.corrected   n
>> Dxy           0.5661   0.6755  0.5559   0.1196          0.4464 200
>> R2            0.2876   0.4085  0.2784   0.1301          0.1575 200
>> Intercept     0.0000   0.0000 -0.2459   0.2459         -0.2459 200
>> Slope         1.0000   1.0000  0.7300   0.2700          0.7300 200
>> Emax          0.0000   0.0000  0.1173   0.1173          0.1173 200
>> D             0.2038   0.3130  0.1970   0.1160          0.0877 200
>> U            -0.0192  -0.0192  0.0382  -0.0574          0.0382 200
>> Q             0.2230   0.3323  0.1589   0.1734          0.0496 200
>> B             0.1441   0.1192  0.1452  -0.0261          0.1702 200
>> g             1.2628   1.9524  1.3222   0.6302          0.6326 199
>> gp            0.2041   0.2430  0.2043   0.0387          0.1654 199
>>
>> If I select only two variables (x1 and procedure), bias-corrected Dxy
>> goes down to 0.45.
>>
>> [Question 1]
>> I have EPV problem. Even so, should I keep the full model (5-variable
>> model)? or can I use the 2-variable (x1 and procedure) model which the
>> validate() with step-down provides?
>>
>> [Question 2]
>> If I use 2-variable model, should I do
>> x2.lrm<- lrm(postopDWI_HI ~ T1+procedure2, data=x6.df, x=T, y=T)?
>> or keep the value showed above by validate function?
>>
>> Next, shrinkage ...
>>
>>   >  pentrace(x6.lrm, seq(0, 5.0, by=0.05))
>> Best penalty:
>> penalty         df
>>      3.05   4.015378
>>
>> The best penalty is 3.05. So, I update it with this penalty to obtain
>> the corresponding penalized model:
>>
>>   >  x6.lrm.pen<- update(x6.lrm, penalty=3.05, x=T, y=T)
>>   >  x6.lrm.pen
>> .....
>> Penalty factors
>>
>>    simple nonlinear interaction nonlinear.interaction
>>      3.05      3.05        3.05                  3.05
>> Final penalty on -2 log L
>>        [,1]
>> [1,]  3.8
>>
>> Obs     104    LR chi2      28.18    R2       0.313    C       0.818
>>    negative    79    d.f.     4.015    g        1.264    Dxy     0.635
>>    positive    25   Pr(>  chi2)<0.0001 gr       3.538    gamma   0.637
>> max |deriv| 3e-05                    gp       0.201    tau-a   0.234
>>                                        Brier    0.129
>>
>>                  Coef    S.E.   Wald Z Pr(>|Z|) Penalty Scale
>> Intercept      -4.7246 2.2429 -2.11  0.0352    0.0000
>> stenosis       -0.0105 0.0240 -0.44  0.6621   17.8021
>> x1              2.3605 0.7254  3.25  0.0011    0.6054
>> x2             -0.5385 0.3653 -1.47  0.1404    1.2851
>> procedure       0.9247 0.4844  1.91  0.0563    0.8576
>> ClinicalScore   0.3046 0.1874  1.63  0.1041    2.4779
>>
>> Arrange the coefficients of the two models side by side, and also list
>> the difference between the two:
>>
>>   >  cbind(coef(x6.lrm), coef(x6.lrm.pen),
>> abs(coef(x6.lrm)-coef(x6.lrm.pen)))
>>                         [,1]        [,2]        [,3]
>> Intercept      -5.53281808 -4.72464766 0.808170417
>> stenosis       -0.01496757 -0.01050797 0.004459599
>> x1              3.04248257  2.36051833 0.681964238
>> x2             -0.75335619 -0.53854750 0.214808685
>> procedure       1.20847252  0.92474708 0.283725441
>> ClinicalScore   0.37623189  0.30457557 0.071656322
>>
>> [Question 3]
>> Is this penalized model the one I should present for my colleagues?
>> I still have EPV problem. Or is EPV problem O.K. if I use penalization?
>>
>> I am still wondering about what I can do to avoid EPV problem.
>> Collecting new data would be a long-time and huge work...
>>
>>
>> (11/04/22 1:46), khosoda at med.kobe-u.ac.jp wrote:
>>> Thank you for your comment.
>>> I forgot to mention that varclus and pvclust showed similar results for
>>> my data.
>>>
>>> BTW, I did not realize rms is a replacement for the Design package.
>>> I appreciate your suggestion.
>>> --
>>> KH
>>>
>>> (11/04/21 8:00), Frank Harrell wrote:
>>>> I think it's OK. You can also use the Hmisc package's varclus function.
>>>> Frank
>>>>
>>>>
>>>> 細田弘吉 wrote:
>>>>>
>>>>> Dear Prof. Harrel,
>>>>>
>>>>> Thank you very much for your quick advice.
>>>>> I will try rms package.
>>>>>
>>>>> Regarding model reduction, is my model 2 method (clustering and
>>>>> recoding
>>>>> that are blinded to the outcome) permissible?
>>>>>
>>>>> Sincerely,
>>>>>
>>>>> --
>>>>> KH
>>>>>
>>>>> (11/04/20 22:01), Frank Harrell wrote:
>>>>>> Deleting variables is a bad idea unless you make that a formal part of
>>>>>> the
>>>>>> BMA so that the attempt to delete variables is penalized for.
>>>>>> Instead of
>>>>>> BMA I recommend simple penalized maximum likelihood estimation (see
>>>>>> the
>>>>>> lrm
>>>>>> function in the rms package) or pre-modeling data reduction that is
>>>>>> blinded
>>>>>> to the outcome variable.
>>>>>> Frank
>>>>>>
>>>>>>
>>>>>> 細田弘吉 wrote:
>>>>>>>
>>>>>>> Hi everybody,
>>>>>>> I apologize for long mail in advance.
>>>>>>>
>>>>>>> I have data of 104 patients, which consists of 15 explanatory
>>>>>>> variables
>>>>>>> and one binary outcome (poor/good). The outcome consists of 25 poor
>>>>>>> results and 79 good results. I tried to analyze the data with
>>>>>>> logistic
>>>>>>> regression. However, the 15 variables and 25 events means events per
>>>>>>> variable (EPV) is much less than 10 (rule of thumb). Therefore, I
>>>>>>> used R
>>>>>>> package, "BMA" to perform logistic regression with BMA to avoid this
>>>>>>> problem.
>>>>>>>
>>>>>>> model 1 (full model):
>>>>>>> x1, x2, x3, x4 are continuous variables and others are binary data.
>>>>>>>
>>>>>>>> x16.bic.glm<- bic.glm(outcome ~ ., data=x16.df,
>>>>>>> glm.family="binomial", OR20, strict=FALSE)
>>>>>>>> summary(x16.bic.glm)
>>>>>>> (The output below has been cut off at the right edge to save space)
>>>>>>>
>>>>>>> 62 models were selected
>>>>>>> Best 5 models (cumulative posterior probability = 0.3606 ):
>>>>>>>
>>>>>>> p!=0 EV SD model 1 model2
>>>>>>> Intercept 100 -5.1348545 1.652424 -4.4688 -5.15
>>>>>>> -5.1536
>>>>>>> age 3.3 0.0001634 0.007258 .
>>>>>>> sex 4.0
>>>>>>> .M -0.0243145 0.220314 .
>>>>>>> side 10.8
>>>>>>> .R 0.0811227 0.301233 .
>>>>>>> procedure 46.9 -0.5356894 0.685148 . -1.163
>>>>>>> symptom 3.8 -0.0099438 0.129690 . .
>>>>>>> stenosis 3.4 -0.0003343 0.005254 .
>>>>>>> x1 3.7 -0.0061451 0.144084 .
>>>>>>> x2 100.0 3.1707661 0.892034 3.2221 3.11
>>>>>>> x3 51.3 -0.4577885 0.551466 -0.9154 .
>>>>>>> HT 4.6
>>>>>>> .positive 0.0199299 0.161769 . .
>>>>>>> DM 3.3
>>>>>>> .positive -0.0019986 0.105910 . .
>>>>>>> IHD 3.5
>>>>>>> .positive 0.0077626 0.122593 . .
>>>>>>> smoking 9.1
>>>>>>> .positive 0.0611779 0.258402 . .
>>>>>>> hyperlipidemia 16.0
>>>>>>> .positive 0.1784293 0.512058 . .
>>>>>>> x4 8.2 0.0607398 0.267501 . .
>>>>>>>
>>>>>>>
>>>>>>> nVar 2 2
>>>>>>> 1 3 3
>>>>>>> BIC -376.9082
>>>>>>> -376.5588 -376.3094 -375.8468 -374.5582
>>>>>>> post prob 0.104
>>>>>>> 0.087 0.077 0.061 0.032
>>>>>>>
>>>>>>> [Question 1]
>>>>>>> Is it O.K to calculate odds ratio and its 95% confidence interval
>>>>>>> from
>>>>>>> "EV" (posterior distribution mean) and“SD”(posterior distribution
>>>>>>> standard deviation)?
>>>>>>> For example, 95%CI of EV of x2 can be calculated as;
>>>>>>>> exp(3.1707661)
>>>>>>> [1] 23.82573 ----->  odds ratio
>>>>>>>> exp(3.1707661+1.96*0.892034)
>>>>>>> [1] 136.8866
>>>>>>>> exp(3.1707661-1.96*0.892034)
>>>>>>> [1] 4.146976
>>>>>>> ------------------>  95%CI (4.1 to 136.9)
>>>>>>> Is this O.K.?
>>>>>>>
>>>>>>> [Question 2]
>>>>>>> Is it permissible to delete variables with small value of "p!=0" and
>>>>>>> "EV", such as age (3.3% and 0.0001634) to reduce the number of
>>>>>>> explanatory variables and reconstruct new model without those
>>>>>>> variables
>>>>>>> for new session of BMA?
>>>>>>>
>>>>>>> model 2 (reduced model):
>>>>>>> I used R package, "pvclust", to reduce the model. The result
>>>>>>> suggested
>>>>>>> x1, x2 and x4 belonged to the same cluster, so I picked up only x2.
>>>>>>> Based on the subject knowledge, I made a simple unweighted sum, by
>>>>>>> counting the number of clinical features. For 9 features (sex, side,
>>>>>>> HT2, hyperlipidemia, DM, IHD, smoking, symptom, age), the sum ranges
>>>>>>> from 0 to 9. This score was defined as ClinicalScore. Consequently, I
>>>>>>> made up new data set (x6.df), which consists of 5 variables
>>>>>>> (stenosis,
>>>>>>> x2, x3, procedure, and ClinicalScore) and one binary outcome
>>>>>>> (poor/good). Then, for alternative BMA session...
>>>>>>>
>>>>>>>> BMAx6.glm<- bic.glm(postopDWI_HI ~ ., data=x6.df,
>>>>>>> glm.family="binomial", OR=20, strict=FALSE)
>>>>>>>> summary(BMAx6.glm)
>>>>>>> (The output below has been cut off at the right edge to save space)
>>>>>>> Call:
>>>>>>> bic.glm.formula(f = postopDWI_HI ~ ., data = x6.df, glm.family =
>>>>>>> "binomial", strict = FALSE, OR = 20)
>>>>>>>
>>>>>>>
>>>>>>> 13 models were selected
>>>>>>> Best 5 models (cumulative posterior probability = 0.7626 ):
>>>>>>>
>>>>>>> p!=0 EV SD model 1 model 2
>>>>>>> Intercept 100 -5.6918362 1.81220 -4.4688 -6.3166
>>>>>>> stenosis 8.1 -0.0008417 0.00815 . .
>>>>>>> x2 100.0 3.0606165 0.87765 3.2221 3.1154
>>>>>>> x3 46.5 -0.3998864 0.52688 -0.9154 .
>>>>>>> procedure 49.3 0.5747013 0.70164 . 1.1631
>>>>>>> ClinicalScore 27.1 0.0966633 0.19645 . .
>>>>>>>
>>>>>>>
>>>>>>> nVar 2 2 1
>>>>>>> 3 3
>>>>>>> BIC -376.9082 -376.5588
>>>>>>> -376.3094 -375.8468 -375.5025
>>>>>>> post prob 0.208 0.175
>>>>>>> 0.154 0.122 0.103
>>>>>>>
>>>>>>> [Question 3]
>>>>>>> Am I doing it correctly or not?
>>>>>>> I mean this kind of model reduction is permissible for BMA?
>>>>>>>
>>>>>>> [Question 4]
>>>>>>> I still have 5 variables, which violates the rule of thumb, "EPV>
>>>>>>> 10".
>>>>>>> Is it permissible to delete "stenosis" variable because of small
>>>>>>> value
>>>>>>> of "EV"? Or is it O.K. because this is BMA?
>>>>>>>
>>>>>>> Sorry for long post.
>>>>>>>
>>>>>>> I appreciate your help very much in advance.
>>>>>>>
>>>>>>> --
>>>>>>> KH
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> R-help at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide
>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> -----
>>>>>> Frank Harrell
>>>>>> Department of Biostatistics, Vanderbilt University
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://r.789695.n4.nabble.com/BMA-logistic-regression-odds-ratio-model-reduction-etc-tp3462416p3462919.html
>>>>>>
>>>>>> Sent from the R help mailing list archive at Nabble.com.
>>>>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
> -----
> Frank Harrell
> Department of Biostatistics, Vanderbilt University
> --
> View this message in context: http://r.789695.n4.nabble.com/BMA-logistic-regression-odds-ratio-model-reduction-etc-tp3462416p3473354.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
*************************************************
　神戸大学大学院医学研究科 脳神経外科学分野
　細田 弘吉

　〒650-0017　神戸市中央区楠町7丁目5-1
     Phone: 078-382-5966
     Fax  : 078-382-5979
     E-mail address
         Office: khosoda at med.kobe-u.ac.jp
	Home  : khosoda at venus.dti.ne.jp
*************************************************
-------------- next part --------------
A non-text attachment was scrubbed...
Name: x6factor.pentrace.pdf
Type: application/pdf
Size: 81745 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110426/6822935a/attachment.pdf>