[R] FW: logistic regression

Frank E Harrell Jr f.harrell at vanderbilt.edu
Mon Sep 29 13:42:41 CEST 2008

Gavin Simpson wrote:
> On Sun, 2008-09-28 at 21:23 -0500, Frank E Harrell Jr wrote:
>> Darin Brooks wrote:
>>> I certainly appreciate your comments, Bert.  It is abundantly clear that I
> <snip />
>>> Darin Brooks   
>> Darin,
>> I think the point is that the confidence you can assign to the "best 
>> available variables" is zero.  That is the probability that stepwise 
>> variable selection will select the correct variables.
>> It is probably better to build a model based on the knowledge in the 
>> field you alluded to, rather than to use P-values to decide.
>> Frank Harrell
> Hi Frank, et al
> I don't have Darin's original email to hand just now, but IIRC he turned
> on the testing by p-values, something that add1 and drop1 do not do by
> default.
> Venables and Ripley's MASS contains stepAIC and there they make use of
> drop1 in the regression chapters (Apologies if I have made sweeping
> statements that are just plain wrong here - I'm at home this morning and
> don't seem to have either of my two MASS copies here with me).
> Would the same criticisms made by yourself and Bert, amongst others, in
> this thread be levelled at simplifying models using AIC rather than via
> p-values? Part of the issue with stepwise procedures is that they don't
> correct the overall Type I error rate (even if you use 0.05 as your
> cut-off for each test, overall your error rate can be much larger). Does
> AIC allow one to get out of this bit of the problem with stepwise
> methods?

AIC is just a restatement of P-values, so using AIC one variable at a 
time is just like using a different alpha.  Both methods have problems.


> I'd appreciate any thoughts you or others on the list may have on this.
> All the best, and thanks for an interesting discussion thus far.
> G
>>> -----Original Message-----
>>> From: Bert Gunter [mailto:gunter.berton at gene.com] 
>>> Sent: Sunday, September 28, 2008 6:26 PM
>>> To: 'David Winsemius'; 'Darin Brooks'
>>> Cc: r-help at stat.math.ethz.ch; ted.harding at manchester.ac.uk
>>> Subject: RE: [R] FW: logistic regression
>>> The Inferno awaits me -- but I cannot resist a comment (but DO look at
>>> Frank's website).
>>> There is a deep and disconcerting dissonance here. Scientists are
>>> (naturally) interested in getting at mechanisms, and so want to know which
>>> of the variables "count" and which do not. But statistical analysis --
>>> **any** statistical analysis -- cannot tell you that. All statistical
>>> analysis can do is build models that give good predictions (and only over
>>> the range of the data). The models you get depend **both** on the way Nature
>>> works **and** the peculiarities of your data (which is what Frank referred
>>> to in his comment on data reduction). In fact, it is highly likely that with
>>> your data there are many alternative prediction equations built from
>>> different collections of covariates that perform essentially equally well.
>>> Sometimes it is otherwise, typically when prospective, carefully designed
>>> studies are performed -- there is a reason that the FDA insists on clinical
>>> trials, after all (and reasons why such studies are difficult and expensive
>>> to do!).
>>> The belief that "data mining" (as it is known in the polite circles that
>>> Frank obviously eschews) is an effective (and even automated!) tool for
>>> discovering how Nature works is a misconception, but one that for many
>>> reasons is enthusiastically promoted.  If you are looking only to predict,
>>> it may do; but you are deceived if you hope for Truth. Can you get hints? --
>>> well maybe, maybe not. Chaos beckons.
>>> I think many -- maybe even most -- statisticians rue the day that stepwise
>>> regression was invented and certainly that it has been marketed as a tool
>>> for winnowing out the "important" few variables from the blizzard of
>>> "irrelevant" background noise. Pogo was right: " We have seen the enemy --
>>> and it is us."
>>> (As I said, the Inferno awaits...)
>>> Cheers to all,
>>> Bert Gunter
>>> -----Original Message-----
>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
>>> Behalf Of David Winsemius
>>> Sent: Saturday, September 27, 2008 5:34 PM
>>> To: Darin Brooks
>>> Cc: r-help at stat.math.ethz.ch; ted.harding at manchester.ac.uk
>>> Subject: Re: [R] FW: logistic regression
>>> It's more a statement that it expresses a statistical perspective very
>>> succinctly, somewhat like a Zen koan.  Frank's book,"Regression Modeling
>>> Strategies", has entire chapters on reasoned approaches to your question.
>>> His website also has quite a bit of material free for the taking.
>>> --
>>> David Winsemius
>>> Heritage Laboratories
>>> On Sep 27, 2008, at 7:24 PM, Darin Brooks wrote:
>>>> Glad you were amused.
>>>> I assume that "booking this as a fortune" means that this was an 
>>>> idiotic way to model the data?
>>>> MARS?  Boosted Regression Trees?  Any of these a better choice to 
>>>> extract significant predictors (from a list of about 44) for a 
>>>> measured dependent variable?
>>>> -----Original Message-----
>>>> From: r-help-bounces at r-project.org 
>>>> [mailto:r-help-bounces at r-project.org
>>>> ] On
>>>> Behalf Of Ted Harding
>>>> Sent: Saturday, September 27, 2008 4:30 PM
>>>> To: r-help at stat.math.ethz.ch
>>>> Subject: Re: [R] FW: logistic regression
>>>> On 27-Sep-08 21:45:23, Dieter Menne wrote:
>>>>> Frank E Harrell Jr <f.harrell <at> vanderbilt.edu> writes:
>>>>>> Estimates from this model (and especially standard errors and
>>>>>> P-values)
>>>>>> will be invalid because they do not take into account the stepwise 
>>>>>> procedure above that was used to torture the data until they 
>>>>>> confessed.
>>>>>> Frank
>>>>> Please book this as a fortune.
>>>>> Dieter
>>>> Seconded!
>>>> Ted.

Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

More information about the R-help mailing list