[R] sample size > 20K? Was: fitness of regression tree: how to measure???

Thu Apr 1 18:53:02 CEST 2010

Since Frank has made this somewhat cryptic remark (sample size > 20K)
several times now, perhaps I can add a few words of (what I hope is) further
clarification.

Despite any claims to the contrary, **all** statistical (i.e. empirical)
modeling procedures are just data interpolators: that is, all that they can
claim to do is produce reasonable predictions of what may be expected within
the extent of the data. The quality of the model is judged by the goodness
of fit/prediction over this extent. Ergo the standard textbook caveats about
the dangers of extrapolation when using fitted models for prediction. Note,
btw, the contrast to "mechanistic" models, which typically **are** assessed
by how well they **extrapolate** beyond current data. For example, Newton's
apple to the planets. They are often "validated" by their ability to "work"
in circumstances (or scales) much different than those from which they were
derived.

So statistical models are just fancy "prediction engines." In particular,
there is no guarantee that they provide any meaningful assessment of
variable importance: how predictors causally relate to the response.
Obviously, empirical modeling can often be useful for this purpose,
especially in well-designed studies and experiments, but there's no
guarantee: it's an "accidental" byproduct of effective prediction.

This is particularly true for happenstance (un-designed) data and
non-parametric models like regression/classification trees. Typically, there
are many alternative models (trees) that give essentially the same quality
of prediction. You can see this empirically by removing a modest random
subset of the data and re-fitting. You should not be surprised to see the
fitted model -- the tree topology -- change quite radically. HOWEVER, the
predictions of the models within the extent of the data will be quite
similar to the original results. Frank's point is that unless the data set
is quite large and the predictive relationships quite strong -- which
usually implies parsimony -- this is exactly what one should expect. Thus it
is critical not to over-interpret the particular model one get, i.e. to
infer causality from the model (tree)structure.

Incidentally, there is nothing new or radical in this; indeed, John Tukey,
Leo Breiman, George Box, and others wrote eloquently about this decades ago.
And Breiman's random forest modeling procedure explicitly abandoned efforts
to build simply interpretable models (from which one might infer causality)
in favor of building better interpolators, although assessment of "variable
importance" does try to recover some of that interpretability (however, no
guarantees are given).

HTH. And contrary views welcome, as always.

Cheers to all,

Bert Gunter
Genentech Nonclinical Biostatistics

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of Frank E Harrell Jr
Sent: Thursday, April 01, 2010 5:02 AM
To: vibha patel
Cc: r-help at r-project.org
Subject: Re: [R] fitness of regression tree: how to measure???

vibha patel wrote:
> Hello,
> 
> I'm using rpart function for creating regression trees.
> now how to measure the fitness of regression tree???
> 
> thanks n Regards,
> Vibha

If the sample size is less than 20,000, assume that the tree is a 
somewhat arbitrary representation of the relationships in the data and 
that the form of the tree will not replicate in future datasets.

Frank

-- 
Frank E Harrell Jr   Professor and Chairman        School of Medicine
                      Department of Biostatistics   Vanderbilt University

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.