[R] Off Topic: Statistical "philosophy" rant

Berton Gunter gunter.berton at gene.com
Thu Jan 13 00:29:01 CET 2005


R-Listers.

The following is a rant originally sent privately to Frank Harrell in
response to remarks he made on this list. The ideas are not new or original,
but he suggested I share it with the list, as he felt that it might be of
wider interest, nonetheless. I have real doubts about this, and I apologize
in advance to those who agree that I should have kept my remarks private.
In view of this, if you wish to criticize my remarks on list, that's fine,
but I won't respond (I've said enough already!). I would be happy to discuss
issues (a little) further off list with anyone who wishes to bother, but not
on list. 

Also, Frank sent me a relevant reference for those who might wish to read a
more thoughtful consideration of the issues:

@ARTICLE{far92cos,
   author = {Faraway, J. J.},
   year = 1992,
   title = {The cost of data analysis},
   journal = J Comp Graphical Stat,
   volume = 1,
   pages = {213-229},
   annote = {bootstrap; validation; predictive accuracy; modeling strategy;
            regression diagnostics;model uncertainty}
}

I welcome further relevant references, pro or con!

Finally, I need to emphasize that these are clearly my very personal views
and do not reflect those of my company or colleagues. 

Cheers to all ...
-----------

The relevant portion of Frank's original comment was in a thread about K-S
tests for the goodness of fit of a parametric distribution:

...
> If you use the empirical CDF to select a parametric 
> distribution, the final estimate of the distribution will inherit the 
> variance of the ECDF.
> The main reason statisticians think that 
> parametric curve fits are far more efficient than 
> nonparametric ones is 
> that they don't account for model uncertainty in their final 
> confidence 
> intervals.
> 
> -- Frank Harrell

My reply:

That's a perceptive remark, but I would go further... You mentioned
**model** uncertainty. In fact, in any data analysis in which we explore the
data first to choose a model, fit the model (parametric or non..), and then
use whatever (pivots from parametric analysis; bootstrapping;...) to say
something about "model uncertainty," we're always kidding ourselves and our
colleagues because we fail to take into account the considerable variability
introduced by our initial subjective exploration and subsequent choice of
modeling strategy. One can only say (at best) that the stated model
uncertainty is an underestimate of the true uncertainty. And very likely a
considerable underestimate because of the model choice subjectivity.

Now I in no way wish to discourage or abridge data exploration; only to
point out that we statisticians have promulgated a self-serving and
unrealistic view of the value of formal inference in quantifying true
scientific uncertainty when we do such exploration -- and that there is
therefore something fundamentally contradictory in our own rhetoric and
methods. Taking a larger view, I think this remark is part of the deeper
epistemological issue of characterizing what can be scientifically "known"
or, indeed, defining the difference between science and art, say. My own
view is that scientific certainty is a fruitless concept: we build models
that we benchmark against our subjective measurements (as the measurements
themselves depend on earlier scientific models) of "reality." Insofar as
data can limit or support our flights of modeling fancy, they do; but in the
end, it is neither an objective process nor one whose "uncertainty" can be
strictly quantified. In creating the illusion that "statistical methods" can
overcome these limitations, I think we have both done science a disservice
and relegated ourselves to an isolated, fringe role in scientific inquiry.

Needless to say, opposing viewpoints to such iconclastic remarks are
cheerfully welcomed.

Best regards,

Bert Gunter




More information about the R-help mailing list