[R] sample size > 20K? Was: fitness of regression tree: how tomeasure???

Mon Apr 5 15:51:03 CEST 2010

Just to follow up on Bert's and Frank's excellent comments.  I'm
continued to be amazed by people trying to interpret a single tree.
Besides the variability in the tree structure (try bootstrapping and see
how the trees change), it is difficult to make sense of splits more than
a few levels down (high order interaction).  

Also, even large sample size won't make up for poorly sampled data.
>From wikipedia's entry on the Kinsey Reports: "In 1948, the same year as
the original publication, a committee of the American Statistical
Association, including notable statisticians such as John Tukey,
condemned the sampling procedure. Tukey was perhaps the most vocal
critic, saying, "A random selection of three people would have been
better than a group of 300 chosen by Mr. Kinsey."

Andy

From: Frank E Harrell Jr
> 
> Good comments Bert.  Just 2 points to add: People rely a lot 
> on the tree 
> structure found by recursive partitioning, so the structure 
> needs to be 
> stable.  This requires a huge samples size.  Second, recursive 
> partitioning is not competitive with other methods in terms of 
> predictive descrimination unless the sample size is so large that the 
> tree doesn't need to be pruned upon cross-validation.
> 
> Frank
> 
> 
> Bert Gunter wrote:
> > Since Frank has made this somewhat cryptic remark (sample 
> size > 20K)
> > several times now, perhaps I can add a few words of (what I 
> hope is) further
> > clarification.
> > 
> > Despite any claims to the contrary, **all** statistical 
> (i.e. empirical)
> > modeling procedures are just data interpolators: that is, 
> all that they can
> > claim to do is produce reasonable predictions of what may 
> be expected within
> > the extent of the data. The quality of the model is judged 
> by the goodness
> > of fit/prediction over this extent. Ergo the standard 
> textbook caveats about
> > the dangers of extrapolation when using fitted models for 
> prediction. Note,
> > btw, the contrast to "mechanistic" models, which typically 
> **are** assessed
> > by how well they **extrapolate** beyond current data. For 
> example, Newton's
> > apple to the planets. They are often "validated" by their 
> ability to "work"
> > in circumstances (or scales) much different than those from 
> which they were
> > derived.
> > 
> > So statistical models are just fancy "prediction engines." 
> In particular,
> > there is no guarantee that they provide any meaningful assessment of
> > variable importance: how predictors causally relate to the response.
> > Obviously, empirical modeling can often be useful for this purpose,
> > especially in well-designed studies and experiments, but there's no
> > guarantee: it's an "accidental" byproduct of effective prediction.
> > 
> > This is particularly true for happenstance (un-designed) data and
> > non-parametric models like regression/classification trees. 
> Typically, there
> > are many alternative models (trees) that give essentially 
> the same quality
> > of prediction. You can see this empirically by removing a 
> modest random
> > subset of the data and re-fitting. You should not be 
> surprised to see the
> > fitted model -- the tree topology -- change quite 
> radically. HOWEVER, the
> > predictions of the models within the extent of the data 
> will be quite
> > similar to the original results. Frank's point is that 
> unless the data set
> > is quite large and the predictive relationships quite 
> strong -- which
> > usually implies parsimony -- this is exactly what one 
> should expect. Thus it
> > is critical not to over-interpret the particular model one 
> get, i.e. to
> > infer causality from the model (tree)structure.
> > 
> > Incidentally, there is nothing new or radical in this; 
> indeed, John Tukey,
> > Leo Breiman, George Box, and others wrote eloquently about 
> this decades ago.
> > And Breiman's random forest modeling procedure explicitly 
> abandoned efforts
> > to build simply interpretable models (from which one might 
> infer causality)
> > in favor of building better interpolators, although 
> assessment of "variable
> > importance" does try to recover some of that 
> interpretability (however, no
> > guarantees are given).
> > 
> > HTH. And contrary views welcome, as always.
> > 
> > Cheers to all,
> > 
> > Bert Gunter
> > Genentech Nonclinical Biostatistics
> >  
> >  
> > -----Original Message-----
> > From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On
> > Behalf Of Frank E Harrell Jr
> > Sent: Thursday, April 01, 2010 5:02 AM
> > To: vibha patel
> > Cc: r-help at r-project.org
> > Subject: Re: [R] fitness of regression tree: how to measure???
> > 
> > vibha patel wrote:
> >> Hello,
> >>
> >> I'm using rpart function for creating regression trees.
> >> now how to measure the fitness of regression tree???
> >>
> >> thanks n Regards,
> >> Vibha
> > 
> > If the sample size is less than 20,000, assume that the tree is a 
> > somewhat arbitrary representation of the relationships in 
> the data and 
> > that the form of the tree will not replicate in future datasets.
> > 
> > Frank
> > 
> 
> 
> -- 
> Frank E Harrell Jr   Professor and Chairman        School of Medicine
>                       Department of Biostatistics   
> Vanderbilt University
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
Notice:  This e-mail message, together with any attachme...{{dropped:10}}