[R] sample size > 20K? Was: fitness of regression tree: how to measure???
Ravi Varadhan
rvaradhan at jhmi.edu
Thu Apr 1 23:23:13 CEST 2010
The discussion of Leo Breiman's paper in Statistical Science: Statistical Modeling - The Two cultures, is a must read for all statisticians doing prediction modeling. Especially see the exchange between Cox and Breiman (I call this the Cox-Breiman duel).
Ravi.
____________________________________________________________________
Ravi Varadhan, Ph.D.
Assistant Professor,
Division of Geriatric Medicine and Gerontology
School of Medicine
Johns Hopkins University
Ph. (410) 502-2619
email: rvaradhan at jhmi.edu
----- Original Message -----
From: Bert Gunter <gunter.berton at gene.com>
Date: Thursday, April 1, 2010 12:55 pm
Subject: Re: [R] sample size > 20K? Was: fitness of regression tree: how to measure???
To: 'Frank E Harrell Jr' <f.harrell at vanderbilt.edu>, 'vibha patel' <vibhapatelddu at gmail.com>
Cc: r-help at r-project.org
> Since Frank has made this somewhat cryptic remark (sample size > 20K)
> several times now, perhaps I can add a few words of (what I hope is) further
> clarification.
>
> Despite any claims to the contrary, **all** statistical (i.e. empirical)
> modeling procedures are just data interpolators: that is, all that
> they can
> claim to do is produce reasonable predictions of what may be expected
> within
> the extent of the data. The quality of the model is judged by the goodness
> of fit/prediction over this extent. Ergo the standard textbook caveats
> about
> the dangers of extrapolation when using fitted models for prediction.
> Note,
> btw, the contrast to "mechanistic" models, which typically **are** assessed
> by how well they **extrapolate** beyond current data. For example, Newton's
> apple to the planets. They are often "validated" by their ability to "work"
> in circumstances (or scales) much different than those from which they
> were
> derived.
>
> So statistical models are just fancy "prediction engines." In particular,
> there is no guarantee that they provide any meaningful assessment of
> variable importance: how predictors causally relate to the response.
> Obviously, empirical modeling can often be useful for this purpose,
> especially in well-designed studies and experiments, but there's no
> guarantee: it's an "accidental" byproduct of effective prediction.
>
> This is particularly true for happenstance (un-designed) data and
> non-parametric models like regression/classification trees. Typically,
> there
> are many alternative models (trees) that give essentially the same quality
> of prediction. You can see this empirically by removing a modest random
> subset of the data and re-fitting. You should not be surprised to see
> the
> fitted model -- the tree topology -- change quite radically. HOWEVER,
> the
> predictions of the models within the extent of the data will be quite
> similar to the original results. Frank's point is that unless the data
> set
> is quite large and the predictive relationships quite strong -- which
> usually implies parsimony -- this is exactly what one should expect.
> Thus it
> is critical not to over-interpret the particular model one get, i.e. to
> infer causality from the model (tree)structure.
>
> Incidentally, there is nothing new or radical in this; indeed, John Tukey,
> Leo Breiman, George Box, and others wrote eloquently about this
> decades ago.
> And Breiman's random forest modeling procedure explicitly abandoned efforts
> to build simply interpretable models (from which one might infer causality)
> in favor of building better interpolators, although assessment of "variable
> importance" does try to recover some of that interpretability
> (however, no
> guarantees are given).
>
> HTH. And contrary views welcome, as always.
>
> Cheers to all,
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
>
> -----Original Message-----
> From: r-help-bounces at r-project.org [ On
> Behalf Of Frank E Harrell Jr
> Sent: Thursday, April 01, 2010 5:02 AM
> To: vibha patel
> Cc: r-help at r-project.org
> Subject: Re: [R] fitness of regression tree: how to measure???
>
> vibha patel wrote:
> > Hello,
> >
> > I'm using rpart function for creating regression trees.
> > now how to measure the fitness of regression tree???
> >
> > thanks n Regards,
> > Vibha
>
> If the sample size is less than 20,000, assume that the tree is a
> somewhat arbitrary representation of the relationships in the data and
>
> that the form of the tree will not replicate in future datasets.
>
> Frank
>
> --
> Frank E Harrell Jr Professor and Chairman School of Medicine
> Department of Biostatistics Vanderbilt University
>
> ______________________________________________
> R-help at r-project.org mailing list
>
> PLEASE do read the posting guide
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
>
> PLEASE do read the posting guide
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list