[R] Coefficients of Logistic Regression from bootstrap - how to get them?

Frank E Harrell Jr f.harrell at vanderbilt.edu
Thu Jul 24 01:11:42 CEST 2008


Gustaf Rydevik wrote:
> On Wed, Jul 23, 2008 at 4:08 PM, Michal Figurski
> <figurski at mail.med.upenn.edu> wrote:
>> Gustaf,
>>
>> I am sorry, but I don't get the point. Let's just focus on predictive
>> performance from the cited passage, that is the number of values predicted
>> within 15% of the original value.
>> So, the predictive performance from the model fit on entire dataset was 56%
>> of profiles, while from bootstrapped model it was 82% of profiles. Well - I
>> see a stunning purpose in the bootstrap step here: it turns an useless
>> equation into a clinically applicable model!
>>
>> Honestly, I also can't see how this can be better than fitting on entire
>> dataset, but here you have a proof that it is.
>>
>> I think that another argument supporting this approach is model validation.
>> If you fit model on entire data, you have no data left to validate its
>> predictions.
>>
>> On the other hand, I agree with you that the passage in methods section
>> looks awkward.
>>
>> In my work on a similar problem, that is going to appear in August in Ther
>> Drug Monit, I used medians since beginning and all the comparisons were done
>> based on models with median coefficients. I think this is what the authors
>> of that paper did, though they might just have had a problem with describing
>> it correctly, and unfortunately it passed through review process unchanged.
>>
> 
> 
> 
> Hi,
> 
> I believe that you misunderstand the passage. Do you know what
> multiple stepwise regression is?
> 
> Since they used SPSS, I copied from
> http://www.visualstatistics.net/SPSS%20workbook/stepwise_multiple_regression.htm
> 
> "Stepwise selection is a combination of forward and backward procedures.
> Step 1
> 
> The first predictor variable is selected in the same way as in forward
> selection. If the probability associated with the test of significance
> is less than or equal to the default .05, the predictor variable with
> the largest correlation with the criterion variable enters the
> equation first.
> 
> 
> Step 2
> 
> The second variable is selected based on the highest partial
> correlation. If it can pass the entry requirement (PIN=.05), it also
> enters the equation.
> 
> Step 3
> 
>>From this point, stepwise selection differs from forward selection:
> the variables already in the equation are examined for removal
> according to the removal criterion (POUT=.10) as in backward
> elimination.
> 
> Step 4
> 
> Variables not in the equation are examined for entry. Variable
> selection ends when no more variables meet entry and removal criteria.
> -----------
> 
> 
> It is the outcome of this *entire process*,step1-4, that they compare
> with the outcome of their *entire bootstrap/crossvalidation/selection
> process*, Step1-4 in the methods section, and find that their approach
> gives better result
> What you are doing is only step4 in the article's method
> section,estimating the parameters of a model *when you already know
> which variables to include*.It is the way this step is conducted that
> I am sceptical about.
> 
> Regards,
> 
> Gustaf
> 

Perfectly stated Gustaf.  This is a great example of needing to truly 
understand a method to be able to use it in the right context.

After having read most of the paper by Pawinski et al now, there are 
other problems.

1. The paper nowhere uses bootstrapping.  It uses repeated 2-fold 
cross-validation, a procedure not usually recommended.

2. The resampling procedure used in the paper treated the 50 
pharmacokinetic profiles on 21 renal transplant patients as if these 
were from 50 patients.  The cluster bootstrap should have been used instead.

3. Figure 2 showed the fitted regression line to the predicted vs. 
observed AUCs.  It should have shown the line of identify instead.  In 
other words, the authors allowed a subtle recalibration to creep into 
the analysis (and inverted the x- and y-variables in the plots).  The 
fitted lines are far enough away from the line of identity as to show 
that the predicted values are not well calibrated.  The r^2 values 
claimed by the authors used the wrong formulas which allowed an 
automatic after-the-fact recalibration (new overall slope and intercept 
are estimated in the test dataset).  Hence the achieved r^2 are misleading.


-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University



More information about the R-help mailing list