[R] [Fwd: Re: Coefficients of Logistic Regression from bootstrap - how to get them?]

Thu Jul 24 16:55:11 CEST 2008

Thank you Frank and all for your advices.

Here I attach the raw data from the Pawinski's paper. I have obtained
permission from the corresponding Author to post it here for everyone.
The only condition of use is that the Authors retain ownership of the
data, and any publication resulting from these data must be managed by them.

The dataset is composed as follows: patient number / MMF dose in [g] /
Day of study (since start of drug administration) / MPA concentrations
[mg/L] in plasma at following time points: 0, 0.5 ... 12 hours / and the
value of AUC(0-12h) calculated using all time-points.

The goal of the analysis, as you can read from the paper, was to
estimate the value of AUC using maximum 3 time-points within 2 hours
post dose, that is using only 3 of the 4 time-points: 0, 0.5, 1, 2 - but
always include the "0" time-point.

In my analysis of similar problem I was also concerned about the fact
that data come from several visits of a single patient. I have examined
the effect of "PT" with repeated "day" using mixed effects model, and
these effects turned out to be insignificant. Do you guys think it is
enough justification to use the dataset as if coming from 50 separate
patients?

Also, as to estimation of the bias, variance, etc, Pawinski used CI and
Sy/x. In my analysis I additionally used RMSE values. Please excuse
another naive question, but: do you think it is sufficient information
to compare between models and account for bias?

Regarding the "multiple stepwise regression" - according to the cited
SPSS manual, there are 5 options to select from. I don't think they used
'stepwise selection' option, because their models were already
pre-defined. Variables were pre-selected based on knowledge of
pharmacokinetics of this drug and other factors. I think this part I
understand pretty well.

I see the Frank's point about recalibration on Fig.2 - although the
expectation was set that the prediction be within 15% of the original
value. In my opinion it is *very strict* - I actually used 20% in my
work. This is because of very high variability and imprecision in the
results themselves. These are real biological data and you have to
account for errors like analytical errors (HPLC method), timing errors
and so on, when you look at these data. In other words, if you take two
blood samples at each time-point from a particular patient, and run
them, you will 100% certainly get two distinct (although similar)
profiles. You will get even more difference, if you run one set of
samples on one day, and another set on second day.

Therefore the value of AUC(0-12) itself, to which we compare the
predicted AUC, is not 'holy' - some variability here is inherent.

Nevertheless, I see that the Fig.2 may be incorrect, if we look from
orthodox statistical perspective. I used the same plots in my work as
well - it's too late now. How should I properly estimate the Rsq then?

I greatly appreciate your time and advices in this matter.

--
Michal J. Figurski

Frank E Harrell Jr wrote:
> Gustaf Rydevik wrote:
>> On Wed, Jul 23, 2008 at 4:08 PM, Michal Figurski
>> <figurski at mail.med.upenn.edu> wrote:
>>
>> Hi,
>>
>> I believe that you misunderstand the passage. Do you know what
>> multiple stepwise regression is?
>>
>> Since they used SPSS, I copied from
>> http://www.visualstatistics.net/SPSS%20workbook/stepwise_multiple_regression.htm 
>>
>>
>> "Stepwise selection is a combination of forward and backward procedures.
>> Step 1
>>
>> The first predictor variable is selected in the same way as in forward
>> selection. If the probability associated with the test of significance
>> is less than or equal to the default .05, the predictor variable with
>> the largest correlation with the criterion variable enters the
>> equation first.
>>
>>
>> Step 2
>>
>> The second variable is selected based on the highest partial
>> correlation. If it can pass the entry requirement (PIN=.05), it also
>> enters the equation.
>>
>> Step 3
>>
>>> From this point, stepwise selection differs from forward selection:
>> the variables already in the equation are examined for removal
>> according to the removal criterion (POUT=.10) as in backward
>> elimination.
>>
>> Step 4
>>
>> Variables not in the equation are examined for entry. Variable
>> selection ends when no more variables meet entry and removal criteria.
>> -----------
>>
>>
>> It is the outcome of this *entire process*,step1-4, that they compare
>> with the outcome of their *entire bootstrap/crossvalidation/selection
>> process*, Step1-4 in the methods section, and find that their approach
>> gives better result
>> What you are doing is only step4 in the article's method
>> section,estimating the parameters of a model *when you already know
>> which variables to include*.It is the way this step is conducted that
>> I am sceptical about.
>>
>> Regards,
>>
>> Gustaf
>>
> 
> Perfectly stated Gustaf.  This is a great example of needing to truly 
> understand a method to be able to use it in the right context.
> 
> After having read most of the paper by Pawinski et al now, there are 
> other problems.
> 
> 1. The paper nowhere uses bootstrapping.  It uses repeated 2-fold 
> cross-validation, a procedure not usually recommended.
> 
> 2. The resampling procedure used in the paper treated the 50 
> pharmacokinetic profiles on 21 renal transplant patients as if these 
> were from 50 patients.  The cluster bootstrap should have been used 
> instead.
> 
> 3. Figure 2 showed the fitted regression line to the predicted vs. 
> observed AUCs.  It should have shown the line of identify instead.  In 
> other words, the authors allowed a subtle recalibration to creep into 
> the analysis (and inverted the x- and y-variables in the plots).  The 
> fitted lines are far enough away from the line of identity as to show 
> that the predicted values are not well calibrated.  The r^2 values 
> claimed by the authors used the wrong formulas which allowed an 
> automatic after-the-fact recalibration (new overall slope and intercept 
> are estimated in the test dataset).  Hence the achieved r^2 are misleading.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Dataset.csv
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20080724/f8ce0b2b/attachment.pl>