Christoph Muller Christoph.Muller at mnp.nl
Wed Sep 5 18:34:22 CEST 2007

Hi, everyone,

I haven't found anything similar in the forum, so here's my problem (I'm no
expert in R nor statistics):

I have a data set of 59.000 cases with 9 variables each (fractional
coverage of 9 different plant types, such as deciduous broad-leaved
temperate trees or evergreen tropical trees etc.), which was generated by a
vegetation model.
In order to evaluate the quality of the vegetation model's output, I want
to compare it to a land-cover data set which has 23 different land-cover
types (such as needle leaved evergreen forest, dense broad-leaved forest,
barren, etc.).
A statistician advised me to use the randomForest package in R and using a
sub-set to generate the random Forest, I get a very good prediction for the
However, I need to evaluate how meaningful this classification is in an
ecological sense (boreal trees should not play a role in the definition of
tropical land-cover types, for example), otherwise I cannot judge the
quality of the vegetation model's output.

Unfortunately, randomForest gives me about 15.000 splits of which about
5000 are end branches (rough guess), so it's very hard and time-consuming
to check each single branch of one of the final trees for its ecological
Is there any utility to summarize the characteristics of each of the 23
prediction classes? Such as "land-cover class 1 has less than 5% of plant
types 1-5, 20-50% of plant type 7 and at least 30% of plant type 8".
Or is there a more suitable method to classify my data?

Thanks a lot in advance!


