[R] random forest and vegetation data

Liaw, Andy andy_liaw at merck.com
Fri Feb 1 20:48:04 CET 2008


First things first:  There's no "final tree" in random forests.  You get
a set of trees (i.e., a forest).  Secondly, a forest "cannot be
interpreted" because of the complexity, not because the splits can't
possibly make sense.  You can try to interpret the trees, as long as you
understand the potential pitfalls of doing that.

Here's how ordered factors are handled in randomForest.  Since the tree
algorithm only make use of the ranks, there's basically no difference
between numerical and ordinal variables.  Thus ordered factors are
simply treated as integers 1 through K where K is the number or levels,
and the underlying algorithm is told that this is just a numeric
variable.  This is how a split point of something like 3.5 appears.

Hope that make some sense.

Best,
Andy 

From: ahelmore at umd.edu
> 
> Hi there,
> 
> I am an environmental studies masters student trying to get 
> my thesis out the door.  I am also newbie at trees in 
> general, but I like what I see in the literature about the 
> random forest algorithm.  I think I get the general gist of 
> things, but even after reading stuff I'm unclear about how I 
> could be getting the results I'm seeing.  I obviously am 
> missing something about how the split points in the final 
> tree are decided.
> 
> I've been using random forests in image classification by 
> entering split values into decision tree classifiers, and 
> that has seemed work very well.  The map output appears 
> legitimate and withheld data gives confusion matrices similar 
> to the predictive errors from the random forest.  This leads 
> me to assume that the split points are effective.
> 
> However now that I've turned to the ecological portion of my 
> analysis, with a data set that contains few variable levels 
> and lots of zeros, suddenly the splitting node information is 
> not making sense.
> 
> Here is my situation.  I have a matrix of study plots that 
> each belong to one of three elevation classes and which each 
> have percent cover class data for 15 plant species associated 
> with them.  
> 
> plot	elev	sp1	sp2	sp3... sp15
> 1	3	0	2	6...      5
> 2	0	0	0	1...      0
> etc.
> 
> The species data are ordered factors from 0-9.  When I run 
> the algorithm using species cover values to predict elevation 
> class, two species alone come up as the best predictors.  
> That makes ecological sense in this setting, given the 
> species ranges in question.
> 
> Here's my difficulty though.  The split point values can't be 
> interpreted, as far as I can tell.  I'm getting split points 
> of, say, 1.5 and 2.5 for a species who's cover is either 0 
> (absent) or 4 and above.  So obviously the split points in 
> the final tree are being generated in some way I don't 
> understand.  Averaged?  
> 
> I've tried running the tree using the data as factors, using 
> the data as ordered factors, and using the data as numerical 
> variables, just to see if I could gain insight into what's 
> going on, but I'm coming up clueless.  My literature hunt 
> reveals repeated instances of folks saying that the final 
> tree can't be interpreted the way other trees are, but I'm 
> not getting a lot on just why that might be.  
> 
> Some folks talk about the final tree being "averaged," others 
> say that "mode," is employed (which doesn't make sense to me 
> if I'm getting 1.5 and 2.5 split values).  If the trees are 
> only good as black box predictors (which is of course a very 
> useful thing in itself), should I even be using the node 
> information in my image classifications?  
> 
> As you see, I'm missing some rather important point or other 
> here.  Can you enlighten?
> 
> Thanks,
> A
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachme...{{dropped:15}}



More information about the R-help mailing list