[R] rpart minimum sample size
Frank E Harrell Jr
f.harrell at vanderbilt.edu
Tue Feb 27 17:08:51 CET 2007
Amy Uhrin wrote:
> Is there an optimal / minimum sample size for attempting to construct a
> classification tree using /rpart/?
>
> I have 27 seagrass disturbance sites (boat groundings) that have been
> monitored for a number of years. The monitoring protocol for each site
> is identical. From the monitoring data, I am able to determine the
> level of recovery that each site has experienced. Recovery is our
> categorical dependent variable with values of none, low, medium, high
> which are based upon percent seagrass regrowth into the injury over
> time. I wish to be able to predict the level of recovery of future
> vessel grounding sites based upon a number of categorical / continuous
> predictor variables used here including (but not limited to) such
> parameters as: sediment grain size, wave exposure, original size
> (volume) of the injury, injury age, injury location.
>
> When I run /rpart/, the data is split into only two terminal nodes based
> solely upon values of the original volume of each injury. No other
> predictor variables are considered, even though I have included about
> six of them in the model. When I remove volume from the model the same
> thing happens but with injury area - two terminal nodes are formed based
> upon area values and no other variables appear. I was hoping that this
> was a programming issue, me being a newbie and all, but I really think
> I've got the code right. Now I am beginning to wonder if my N is too
> small for this method?
>
In my experience N needs to be around 20,000 to get both good accuracy
and replicability of patterns if the number of potential predictors is
not tiny. In general, the R^2 from rpart is not competitive with that
from an intelligently fitted regression model. It's just a difficult
problem, when relying on a single tree (hence the popularity of random
forests, bagging, boosting).
Frank
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
More information about the R-help
mailing list