[R] regression modeling

Wed Apr 26 01:08:15 CEST 2006

Berton Gunter wrote:
> May I offer a perhaps contrary perspective on this.
> 
> Statistical **theory** tells us that the precision of estimates improves as
> sample size increases. However, in practice, this is not always the case.
> The reason is that it can take time to collect that extra data, and things
> change over time. So the very definition of what one is measuring, the
> measurement technology by which it is measured (think about estimating tumor
> size or disease incidence or underemployment, for example), the presence or
> absence of known or unknown large systematic effects, and so forth may
> change in unknown ways. This defeats, or at least complicates, the
> fundamental assumption that one is sampling from a (fixed) population or
> stable (e.g. homogeneous, stationary) process, so it's no wonder that all
> statistical bets are off. Of course, sometimes the necessary information to
> account for these issues is present, and appropriate (but often complex)
> statistical analyses can be performed. But not always.
> 
> Thus, I am suspicious, cynical even, about those who advocate collecting
> "all the data" and subjecting the whole vast heterogeneous mess to arcane
> and ever more computer intensive (and adjustable parameter ridden) "data
> mining" algorithms to "detect trends" or "discover knowledge." To me, it
> sounds like a prescription for "turning on all the equipment and waiting to
> see what happens" in the science lab instead of performing careful,
> well-designed experiments.
> 
> I realize, of course, that there are many perfectly legitimate areas of
> scientific research, from geophysics to evolutionary biology to sociology,
> where one cannot (easily) perform planned experiments. But my point is that
> good science demands that in all circumstances, and especially when one
> accumulates and attempts to aggregata data taken over spans of time and
> space, one needs to beware of oversimplification, including statistical
> oversimplification. So interrogate the measurement, be skeptical of
> stability, expect inconsistency. While "all models are wrong but some are
> useful" (George Box), the second law tells us that entropy still rules.
> 
> (Needless to say, public or private contrary views are welcome).
> 
> -- Bert Gunter
> Genentech Non-Clinical Statistics
> South San Francisco, CA

Bert raises some great points.  Ignoring the important issues of
doing good research and stability in the meaning of data as time marches 
on, it is generally true that the larger the sample size the greater the 
complexity of the model we can afford to fit, and the better the fit of 
the model.  This is the "AIC" school.  The "BIC" school assumes there is 
an actual model out there waiting for us, of finite dimension, and the 
complexity of our models should not grow very fast as N increases.  I 
find the "AIC" approach gives me more accurate predictions.
-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University