[R] regression modeling

Tue Apr 25 22:47:12 CEST 2006

May I offer a perhaps contrary perspective on this.

Statistical **theory** tells us that the precision of estimates improves as
sample size increases. However, in practice, this is not always the case.
The reason is that it can take time to collect that extra data, and things
change over time. So the very definition of what one is measuring, the
measurement technology by which it is measured (think about estimating tumor
size or disease incidence or underemployment, for example), the presence or
absence of known or unknown large systematic effects, and so forth may
change in unknown ways. This defeats, or at least complicates, the
fundamental assumption that one is sampling from a (fixed) population or
stable (e.g. homogeneous, stationary) process, so it's no wonder that all
statistical bets are off. Of course, sometimes the necessary information to
account for these issues is present, and appropriate (but often complex)
statistical analyses can be performed. But not always.

Thus, I am suspicious, cynical even, about those who advocate collecting
"all the data" and subjecting the whole vast heterogeneous mess to arcane
and ever more computer intensive (and adjustable parameter ridden) "data
mining" algorithms to "detect trends" or "discover knowledge." To me, it
sounds like a prescription for "turning on all the equipment and waiting to
see what happens" in the science lab instead of performing careful,
well-designed experiments.

I realize, of course, that there are many perfectly legitimate areas of
scientific research, from geophysics to evolutionary biology to sociology,
where one cannot (easily) perform planned experiments. But my point is that
good science demands that in all circumstances, and especially when one
accumulates and attempts to aggregata data taken over spans of time and
space, one needs to beware of oversimplification, including statistical
oversimplification. So interrogate the measurement, be skeptical of
stability, expect inconsistency. While "all models are wrong but some are
useful" (George Box), the second law tells us that entropy still rules.

(Needless to say, public or private contrary views are welcome).

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA

"The business of the statistician is to catalyze the scientific learning
process."  - George E. P. Box

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Weiwei Shi
> Sent: Tuesday, April 25, 2006 12:10 PM
> To: bogdan romocea
> Cc: r-help
> Subject: Re: [R] regression modeling
> 
> i believe it is not a question only related to regression 
> modeling. The
> correlation between the sample size and confidence of 
> prediction in data
> mining is not as clear as traditional stat approach.  My 
> concern is not in
> that theoretical discussion but more practical, looking for a 
> good algorithm
> when response variable is continuous when large dataset is concerned.
> 
> On 4/25/06, bogdan romocea <br44114 at gmail.com> wrote:
> >
> > There is an aspect, worthy of careful consideration, you 
> don't seem to
> > be aware of. I'll ask the question for you: How does the
> > explanatory/predictive potential of a dataset vary as the 
> dataset gets
> > larger and larger?
> >
> >
> > > -----Original Message-----
> > > From: r-help-bounces at stat.math.ethz.ch
> > > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Weiwei Shi
> > > Sent: Monday, April 24, 2006 12:45 PM
> > > To: r-help
> > > Subject: [R] regression modeling
> > >
> > > Hi, there:
> > > I am looking for a regression modeling (like regression
> > > trees) approach for
> > > a large-scale industry dataset. Any suggestion on a package
> > > from R or from
> > > other sources which has a decent accuracy and scalability? Any
> > > recommendation from experience is highly appreciated.
> > >
> > > Thanks,
> > >
> > > Weiwei
> > >
> > > --
> > > Weiwei Shi, Ph.D
> > >
> > > "Did you always know?"
> > > "No, I did not. But I believed..."
> > > ---Matrix III
> > >
> > >       [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > R-help at stat.math.ethz.ch mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide!
> > > http://www.R-project.org/posting-guide.html
> > >
> >
> 
> 
> 
> --
> Weiwei Shi, Ph.D
> 
> "Did you always know?"
> "No, I did not. But I believed..."
> ---Matrix III
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
>